Novel oligonucleotide arrays and their use for sorting, isolating, sequencing, and manipulating nucleic acids

ABSTRACT

A method of sorting mixtures of nucleic acid strands comprising hybridizing the strands to an array of immobilized oligonucleotides, each of which includes a constant segment adjacent to a variable segment. The constant segment of the immobilized oligonucleotides can be made complementary to the ends of strands obtained by digesting a double-stranded nucleic acid with a restriction enzyme and restoring the restriction sites, thereby permitting the sorting of strands according to their variable sequences adjacent to their constant terminal restored restriction sites.

FIELD OF THE INVENTION

This invention is in the field of sorting, isolating, sequencing, andmanipulating nucleic acids.

BACKGROUND OF THE INVENTION

Ordered arrays of oligonucleotides immobilized on a solid support havebeen proposed for sequencing DNA fragments. It has been recognized thathybridization of a cloned single-stranded DNA fragment to all possibleoligonucleotide probes of a given length can identify the corresponding,complementary oligonucleotide segments that are present somewhere in thefragment, and that this information can sometimes be used to determinethe DNA sequence. Use of arrays can greatly facilitate the surveying ofa DNA fragment's oligonucleotide segments. There are two approachescurrently being employed.

In one approach, each oligonucleotide probe is immobilized on a solidsupport at a different predetermined position, forming an array ofoligonucleotides. The array allows one to simultaneously survey all theoligonucleotide segments in a DNA fragment strand. Many copies of thestrand are required, of course. Ideally, surveying is carried out underconditions to ensure that only perfectly matched hybrids will form.Oligonucleotide segments present in the strand can be identified bydetermining those positions in the array where hybridization occurs. Thenucleotide sequence of the DNA sometimes can be ascertained by orderingthe identified oligonucleotide segments in an overlapping fashion. Forevery identified oligonucleotide segment, there must be anotheroligonucleotide segment whose sequence overlaps it by all but onenucleotide. The entire sequence of the DNA strand can be represented bya series of overlapping oligonucleotides, each of equal length, and eachlocated one nucleotide further along the sequence. As long as everyoverlap is unique, all of the identified oligonucleotides can beassembled into a contiguous sequence block [Bains, W. and Smith, G.(1988). A Novel Method for Nucleic Acid Sequence Determination, J.Theor. Biol. 135, 303-307; Lysov, Yu. P., Florentiev, V. L., Khorlin, A.A., Khrapko, K. R., Shik, V. V. and Mirzabekov, A. D., (1988).Determination of the Nucleotide Sequence of DNA Using Hybridization toOligonucleotides. A New Method, Doklady Akademii Nauk SSSR 303,1508-1511]. The practical feasibility of using oligonucleotide arraysfor sequencing nucleic acid fragments has been demonstrated in modelexperiments in which short synthetic DNA strands made of pyrimidineswere hybridized to an array containing the 4,096 possible octapurines[Maskos, U. and Southern, E. M. (1991). Analyzing Nucleic Acids byHybridization to Arrays of Oligonucleotides: Evaluation of SequenceAnalysis, In Genome Mapping and Sequencing (Abstracts of paperspresented at the 1991 meeting arranged by M. Olson, C. Cantor and R.Roberts), p. 143, Cold Spring Harbor Laboratory, Cold Spring Harbor,N.Y.].

An attractive feature of sequencing by oligonucleotide hybridization isits suitability for being automated. Another attractive feature is itstolerance of detection errors. There is an inherent redundancy in thedata, due to the overlapping nature of the oligonucleotides. Incontradistinction, current prevalent sequencing methods are based on thereading of sequences one nucleotide at a time, and it is common tooverlook a legitimate nucleotide or to insert an illegitimatenucleotide. There is, however, an important limitation to sequencing byknown surveying techniques. As relatively longer DNA strands aresurveyed, there is an increasing probability that more than twoidentified oligonucleotides will share the same overlapping sequence,i.e., the overlap is not unique. When this occurs, the sequence of theDNA cannot be unambiguously determined. Instead of one contiguoussequence block that contains the entire DNA sequence, theoligonucleotides can only be assembled into a number of smaller sequenceblocks, whose order is not known. Lysov et al. have estimated that, ifoligonucleotide probes 8 nucleotides in length are used, then at least20 percent of all random sequences merely 200 nucleotides in length cannot be assembled into a single sequence block, because of the presenceof non-unique overlaps. The longer the DNA sequence, the worse thisproblem becomes. Khrapko et al. suggested that the ambiguities inreconstruction of a DNA sequence caused by the presence of non-uniqueoverlaps between surveyed oligonucleotides could be resolved by asecondary hybridization of the DNA-oligonucleotide complexes to a seriesof short oligonucleotides, so that the two hybrids would stack on eachother, thus producing a longer duplex (Khrapko, K. R., Lysov, Yu. P.,Khorlin, A. A., Shik, V. V., Florentiev, V. L. and Mirzabekov, A. D.(1989). An Oligonucleotide Hybridization Approach to DNA Sequencing,FEBS Lett. 256, 118-122].

Another way of using arrays for DNA sequencing has been proposed byDrmanac et al. In their method, many different cloned DNA strands areeach bound to a solid support at a different position. All are thentested in parallel for their ability to form a hybrid with each of thepossible oligonucleotides of a given length. One oligonucleotide at atime is tested. To resolve ambiguities arising because of the presenceof non-unique overlaps between the oligonucleotides revealed in a DNAstrand, it has been suggested that a library of densely overlappingcloned fragments be prepared and analyzed. The library would be composedof approximately 500-nucleotide-long DNA strands with a 40-nucleotideaverage displacement. [Drmanac, R., Labat, I., Brukner, I. andCrkvenjakov, R. (1989). Sequencing of Megabase Plus DNA byHybridization: Theory of the Method, Genomics 4, 114-128]. Thefeasibility of this method has also been demonstrated [Strezoska, Z.,Paunesky, T., Radosavljevic, D., Labat, I., Drmanac, R. and Crkvenjakov,R. (1991). DNA Sequencing by Hybridization: 100 Bases Read by a Non-gelMethod, Proc. Natl. Acad. Sci. U.S.A. 88, 10089-10093].

The sequencing techniques described above, as well as conventionalsequencing techniques, rely on cloning the fragments to be sequenced.Cloning of DNA fragments is well known. For cloning, DNA fragments areligated into cloning vectors (e.g., plasmids or bacteriophage DNAs),which are then introduced by means of transformation into microbialcells, where they are amplified. At appropriate ratios offragment-to-vector and vector-to-cell, there will be only one fragmentligated into a vector molecule, and only one recombinant moleculeintroduced into each transformed cell. By obtaining progeny fromindividual transformed cells (clones) individual DNA fragments can beisolated. If a large DNA (e.g., a genome) were to be sequenced, it firstwould be cleaved into pieces of suitable size by, for example, digestionwith a restriction endonuclease. The goal of the cloning procedure, inthis case, is to obtain a comprehensive library of cloned fragments,which, taken together, comprise every segment of the DNA to besequenced. However, the completion of a clone library is essentially anasymptotic process. Because fragment cloning is intrinsically random,the number of clones that have to be isolated and analyzed is muchgreater than the number of different restriction fragments produced bydigestion of the original DNA [Sambrook, J., Fritsch, E. F. andManiatis, T. (1989). Molecular Cloning: A Laboratory Manual; 2ndedition, Cold Spring Harbor Laboratory Press, Cold Spring Harbor, N.Y.].Moreover, there is no way to know whether the library is comprehensiveor not, until the sequenced fragments are finally assembled. The cloningof fragments of an entire genome is extremely slow and tedious.

Recently, in place of classic cloning techniques, individual DNAfragments have been amplified by the polymerase chain reaction (PCR).Briefly, this method is based on the hybridization of twooligodeoxynucleotide probes (primers) to DNA strands and the extensionof these primers by incubation with DNA polymerase. The primers areintended to hybridize to unique locations within complementary strandsof the same DNA molecule, and their growing 3′ termini are directedtowards each other, so that their extension results in the replicationof the DNA region included between then. The DNA template and productstrands are then melted apart at elevated temperature to allow the nextround of replication, where both the product strand and the templatestrand serve as templates for additional replication. This process isrepeated many times by cycling between the annealing and meltingtemperatures, resulting in exponential amplification of the targetregion [see for example, Mullis et al., U.S. Pat. Nos. 4,800,159 and4,965,188], incorporated by reference herein. The advantage of PCR overcloning is that fragment isolation becomes deterministic, instead ofbeing random. However, in order to use PCR for preparing DNA fragments,two unique oligonucleotide primers must be synthesized for every newfragment that is amplified. Moreover, the terminal sequences of eachfragment must be known in advance. This latter circumstance makes PCR,in its current form, barely useful for the preparation of individualfragments of unknown nucleotide sequences.

SUMMARY OF THE INVENTION

We have invented new oligonucleotide arrays and methods of using them.

A binary array according to the invention contains immobilizedoligonucleotides comprised of two sequence segments of predeterminedlength, one variable and the other constant. The constant segment is thesame in every oligonucleotide of the array. The variable segments canvary both in sequence and length. Binary arrays have advantages comparedwith ordinary arrays: (1) they can be used to sort strands according totheir terminal sequences, so that each strand binds to a fixed location(an address) within the array; (2) longer oligonucleotides can be usedon an array of a given size, thereby increasing the selectivity ofhybridization; this allows strands to be sorted according to theidentity of internal oligonucleotide segments adjacent to a particularconstant sequence (such as a segment adjacent to a recognition site fora particular restriction endonuclease), and this allows strands to besurveyed for the presence of signature oligonucleotides that contain aconstant segment in addition to a variable segment; (3) universalsequences, such as priming sites, can be introduced into the termini ofsorted strands using the binary arrays, thereby enabling the strands'specific amplification without synthesizing primers specific for eachstrand, and without knowledge of each strand's terminal sequences; and(4) the specificity of hybridization during surveying can be increasedby coupling hybridization to a ligation event that discriminates againstterminal basepair mismatches.

A sectioned array as used herein is an array that is divided intosections, so that every individual area is mechanically separated fromall other areas, such as, for example, a depression on the surface, or a“well”. The areas have different oligonucleotides immobilized thereon. Asectioned array allows many reactions to be performed simultaneously,both on the surface of the solid support and in solution, without mixingthe products of different reactions. The reactions occurring indifferent wells are highly specific, the specificity of the reactionoccurring in each well being determined by the nucleotide sequence ofthe oligonucleotide immobilized on the surface. This allows a largenumber of sortings and manipulations of nucleic acids to be carried outin parallel, by amplifying or modifying only those nucleic acids in eachwell that are perfectly hybridized to the immobilized oligonucleotides.Nucleic acids prepared on a sectioned array can be transferred to otherarrays (replicated) by direct blotting of the wells' contents(printing), without mixing the contents of different wells of the samearray. Furthermore, the presence of individual sections in arrays allowsmultiple re-hybridizations of bound nucleic acids to be performed,resulting in a significant increase in hybridization specificity. It isparticularly advantageous according to this invention to use a binaryarray that is sectioned.

An important feature of arrays which determines their use in the methodsdescribed herein is the way oligonucleotides are attached to theirsurfaces. For many applications we prefer arrays in which the 3′ end ofeach immobilized oligonucleotides is free, enabling it to be extended byincubation with a DNA polymerase, utilizing a strand hybridized to theoligonucleotide as a tenplate. This provides: (1) a further increase inhybridization specificity, because hybrid extension by DNA polymerase ishighly sensitive to terminal mismatches; (2) the ability to obtainstrand copies (complementary to the hybridized strands) covalentlylinked to the array surface, which allows the arrays to be vigorouslywashed to remove non-covalently bound material, and allows the arrays toserve as permanent banks of sorted nucleic acid strands; and (3) theability to generate partial copies of hybridized strands by extendingthe immobilized oligonucleotide after it has bound to an internalsegment of the hybridized strand.

Our invention includes methods of using sectioned arrays to sortmixtures of nucleic acid strands, either RNA or DNA. As used herein,“strand” means not just a single strand, but multiple copies thereof;and “mixture of strands” means a mixture of copies of different strandsno matter how many copies of each is present. Similarly “fragment”refers to multiple copies thereof, and “mixture of fragments” means amixture of copies of different fragments. The methods include sortingnucleic acid strands either according to their terminal oligonucleotidesegments (3′-terminal or 5′-terminal), or according to their internaloligonucleotide segments on a binary array. Before or after sorting,universal priming region(s) can be added to the strands' termini toenable their subsequent amplification. Binary sectioned arrays forsorting according to strands' terminal sequences (“terminal sequencesorting arrays”) can be “comprehensive”. A comprehensive array is onewherein any possible strand will hybridize to at least one immobilizedoligonucleotide. This type of sorting is particularly useful forpreparing comprehensive libraries of fragments of a large genome. Forexample, in one embodiment of the invention, strands of restrictionfragments have their restriction sites restored and are sorted on abinary array. That array contains immobilized oligonucleotides whoseconstant segments contain the sequence complementary to the restrictionsite, and an adjacent variable segment. The array is complete,containing all variable sequences of each type in separate areas.

Our invention also includes methods of using sectioned arrays,preferably binary, for isolating individual strands (or pairs of allelicstrands in the case of a diploid genome). If the starting material is acomplex mixture of strands, such as resulting from a restriction digestof an entire human genome, the isolation is performed in two stages. Inthe first stage, the strands are sorted into groups according to theidentity of their terminal sequences, and then amplified to producedirect and/or complementary copies of the bound strands. In the secondstage, isolation of individual strands is achieved by sorting the strandcopies in each area of the first array on a second array according totheir terminal sequences. If the strands were sorted according to their3′ sequences on the first array, the direct copies are sorted by their5′ terminal sequences, or the complementary copies are sorted by their3′ terminal sequences. There are also embodiments wherein individualstrands can be obtained by sorting strands according to their internalsequences.

Our invention also includes using sectioned arrays for preparing everypossible partial copy of a strand or a group of strands. The term“partial” refers to multiple copies thereof. Partials are prepared byeither of the following methods: (1) terminal sorting on a binarysectioned array of a mixture of all possible partial strands generatedby random degradation of a parental strand; or (2) generation ofpartials directly on an array, through the sorting on an ordinarysectioned array of parental strands according to the identity of theirinternal oligonucleotide sequences, followed by the synthesis of partialcopies of each parental strand by enzymatic extension of the immobilizedoligonucleotides on the array utilizing the hybridized parental strandsas templates. In either case, the partials that are generated correspondto a parental strand whose 3′ or 5′ end is truncated to all possibleextents (at the “variable” end of the partial), and whose other end ispreserved (at the “fixed” end of the partial). These are “one-sidedpartials.” Unless otherwise indicated the word “partial” is used hereinto refer to one-sided partials. Our invention also includes thepreparation of “two-sided partials” that correspond to a parental strandthat is truncated to any extent from both ends using our procedures forpreparing one sided partials. These are prepared in a two-stageprocedure, each stage resulting in the truncation of one of the ends. Ifa parental strand has been truncated at its 3′ end in the first stage(its 5′ end being fixed), the resulting partials are truncated in thesecond stage from the other side by either truncation of direct copiesat their 5′ ends (their 3′ ends being fixed), or by truncation ofcomplementary copies at their 3′ ends (their 5′ ends being fixed).

Our invention also includes using sectioned arrays to isolate individualpartials from one parental strand or from a group of parental strands.

Our invention also includes methods of using sectioned arrays forcarrying out recombinations between chosen segments of previouslysequenced nucleic acids. The recombination can be performed on an arrayin a massively parallel and precisely directed procedure. Therecombinants can be constructed from isolated strands or their partials,from mixtures of strands, or from mixtures of their partials.

Our invention also includes methods of using sectioned arrays for themassively parallel introduction of site-directed mutations intosequenced nucleic acids, including the introduction of nucleotidesubstitutions, deletions, and insertions, using isolated partials, ormixtures of partials. In particular, a single array can be used in oneprocedure, either to alter many single positions in a gene, or tointroduce alterations in many genes. Sectioned arrays can also besubsequently used for the massively parallel testing of the biologicaleffects of the introduced mutations.

Our invention also includes methods of using oligonucleotide arrays forobtaining oligonucleotide information as part of a process fordetermining the nucleotide sequence of a long nucleic acid strand, or ofmany nucleic acid strands in an unknown mixture. A complete set ofone-sided partials of the strand or strands is prepared on a sectionedarray, and the oligonucleotide content of the partial strands in eachwell of the array is separately surveyed (i.e. each group of partialssharing the same oligonucleotide at the partials' variable end issurveyed). Once the oligonucleotide information is obtained, we infer“address sets”. Each address set is a complete list of alloligonucleotides that are contained in the parental strand, or strands,sharing a particular oligonucleotide. We then decompose the address setsinto their constituent “strand sets”, which are complete lists of all ofthe oligonucleotides that are contained in each parental strand. Toarrive at the oligonucleotide sequence of the starting strand(s), theorder of oligonucleotides in each strand is then inferred by analyzingthe distribution of the oligonucleotides between the “upstream subset”(i.e., 5′) and “downstream subset” (i.e., 3′) of the relevant addresssets.

Our invention also includes methods of using oligonucleotide arrays forordering previously sequenced fragments from a first restriction digestof a large nucleic acid or even a genome. This involves sorting a second(alternate) restriction digest of long DNA into groups of strands on asectioned array, preferably on a sectioned binary array, and preferablyby the oligonucleotides adjacent to the first or second restrictionsite. Then the sorted strands in each well are amplified (preferably bysymmetric PCR). Then two surveys of the strands in each well are carriedout with binary arrays to identify “signature oligonucleotides” that arepresent in the strands of each well of the sorting array. A signatureoligonucleotide is a variable sequence and an adjacent restrictionrecognition sequence, using the first restriction recognition sequencefor one survey and the second restriction recognition sequence for theother survey. Then it is then determined which of all pairwisecombinations of signature oligonucleotides found in each well correspondto the signatures of the “intersite segments” that actually occur amongthe sequenced fragments. An “intersite segment” is a segment of a DNAfragment between two closest restriction recognition sites of eithertype. Thus, an intersite segment always has two signatureoligonucleotides, of either type, and they are always located at itstermini. The pair of signature oligonucleotides is an “intersite segmentsignature” or, for short, a “signature”. Then we determine whichcombinations of two or more intersite segments accompany each other intwo or more different wells of the sorting array by determining whichcombination of two or more signatures accompany each other in two ormore different wells of the sorting array. Then the sequenced fragmentsfrom the first digest are ordered according to which intersite segmentsaccompany one another. Repetition of the process with further digestsmay be needed to accomplish the ordering of all sequenced fragments.

Our invention also includes methods of using oligonucleotide arrays forallocating sequenced and ordered allelic fragments into theirchromosomal linkage groups. These methods include the preparation on asectioned array of selected one-sided partials from selected fragmentsof an alternate digest. The selected partials span allelic differencesin neighboring allelic pairs of sequenced fragments. Oligonucleotides inthe selected partials which contain the allelic differences aresurveyed, and the fragments thereby allocated.

Our invention also includes a method of using binary arrays forsurveying the oligonucleotides contained in nucleic acid strands ortheir partials. This method provides improved comprehensive surveys overthe conventional surveying of oligonucleotides on an ordinary array. Themethod is especially useful for strand sequencing, for allocatingallelic fragments to their chromosomes, as well as for surveys ofselected oligonucleotides in, for example, a clinical diagnosticprocedure. The method can also be performed to survey special types ofoligonucleotides, for example, surveying signature oligonucleotides toorder sequenced fragments.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 shows a binary array.

FIG. 1 a shows an oligonucleotide immobilized in an area of a binaryarray.

FIG. 2 shows a sectioned array having depressions.

FIG. 2 a shows a well of a sectioned array.

FIG. 3 shows addition of a lattice to a support to make a sectionedarray.

FIG. 4 shows an example of sorting and amplification of restrictionfragments on a sectioned binary array.

FIG. 5 shows an example of preparing partials on a sectioned ordinaryarray.

FIG. 6 shows, schematically, the order of steps for sequencing acomplete genome.

FIG. 7 shows, schematically, the use of a sheet with a number ofminiature survey arrays for simultaneous surveying every well in apartialing array.

FIG. 8 shows, schematically, how a downstream subset and an address setare inferred from the oligonucleotide content of all possible partialsof a strand.

FIG. 9 shows a complete set of partials generated for a nucleic acidstrand, and the assembly of one of its address sets from theoligonucleotide information obtained from those partials.

FIG. 9 a shows how the oligonucleotides that are in a strand set can beassembled into sequence blocks.

FIG. 10 shows, schematically, how the information obtained from indexedaddress sets can be used to determine the order of sequence blocks.

FIG. 11 shows, schematically, how unindexed address sets can be inferredfrom a survey of the oligonucleotides that are present in the partialsgenerated at different addresses from a mixture of strands.

FIG. 12 shows, schematically, the decomposition of address sets intotheir constituent strand sets.

FIG. 13 shows, schematically, the decomposition of a pseudo-primeaddress set into its constituent strand sets.

FIG. 14 shows, schematically, principles used to identify neighboringrestriction fragments which have been sequenced.

FIG. 15 shows, schematically, the prediction of segment signatures andtheir locations within a sorting array.

FIG. 16 shows the ordering of fragments from the distribution of theirsignatures within a sorting array.

FIG. 17 shows the linking of fragments in neighboring allelic pairs.

FIGS. 18 to 28 show examples of the determination of nucleotidesequences from indexed address sets obtained from analysis of mixturesof strands.

DETAILED DESCRIPTION OF THE INVENTION

Throughout the detailed description, references to the examples sectionare made to illustrate particular embodiments of the aspect of theinvention discussed. Also, techniques described with respect to oneembodiment may not be explicitly described in other embodiments. Theirapplication to the several embodiments described herein, however, isunderstood.

All periodicals, patents and other references cited herein are herebyincorporated by reference.

I. Oligonucleotide Arrays

As used herein an “oligonucleotide array” is an array of regularlysituated areas on a solid support wherein different oligonucleotides areimmobilized, typically by covalent linkage. Each area contains adifferent oligonucleotide, and the location within the array of eacholigonucleotide is predetermined. If the array is made ofoligodeoxyribonucleotides, the nucleotides are: deoxyadenylate (dA),deoxycytidylate (dC), deoxyguanylate (dG), and deoxythymidylate (dT)(for brevity, the prefix “d” is often omitted herein). If the array ismade of oligoribonucleotides, the nucleotides are: adenylate (A),cytidylate (C), guanylate (G), and uridylate (U). The array can alsocontain mixed oligonucleotides, comprised of both ribonucleotides anddeoxyribonucleotides, and can include non-standard bases (such asinosine) or modified bases. The oligonucleotides can also possessmodified ribose groups or modified phosphate groups (such as occur inthe nucleoside phosphorothioates). During hybridization, C pairs with G,and A pairs with T (or U), irrespective of the nature of the sugarmoiety. These basepairs are perfect “matches”. All other pairwisecombinations are “mismatches”.

Arrays can be classified by the composition of their immobilizedoligonucleotides. “Ordinary arrays” known in the art are arrays made ofoligonucleotides that are comprised entirely of “variable segments”.Every position of the oligonucleotide sequence in such a segment can beoccupied by any one of the four commonly occurring nucleotides.

Comprehensive ordinary arrays are those wherein any segment of anypossible strand will hybridize perfectly to the length of one or more ofthe immobilized oligonucleotides in the array. Therefore, any possiblestrand can be hybridized to one or more of the immobilizedoligonucleotides, so that no strand is lost. An example of acomprehensive ordinary array is one having oligonucleotides all of thesame length n, in which case the number of different oligonucleotides is4n, each oligonucleotide being situated in a predetermined position;e.g., if the immobilized oligonucleotides are eight oligonucleotideslong, the number of different areas required to make the arraycomprehensive is 4⁸ or 65,536. Another example of a functionallyequivalent comprehensive ordinary array is one having oligonucleotidesnot all of the same length n, e.g., where a given oligonucleotide oflength n is replaced by four oligonucleotides of length n+1 or sixteenoligonucleotides of length n+2 (as a concrete example, theeight-nucleotide oligonucleotide ACGTTGGG could be replaced by fournine-nucleotide oligonucleotides ACGTTGGGGA, ACGTTGGGC, ACGTTGGGG andACGCTGGGT). As used herein, “a comprehensive array of oligonucleotidesof length n” refers to both of the above types. If the lengths of theoligonucleotides in a comprehensive array are not the same, the length nis the “basic length”. In such an array, perfectly matched hybrids ofthe different lengths can be formed using methods described herein.

As a functional equivalent to an array with immobilized oligonucleotidesof variable length as discussed above, the length of all of theimmobilized oligonucleotides can be made the same by includingdegenerate positions at the free ends of shorter oligonucleotides in thearray. It can be easier to discriminate against mismatched hybrids ifthe immobilized oligonucleotides are all of the same length. Forexample, a shorter immobilized oligonucleotide can be replaced with fouroligonucleotides having “A”, “T”, “G” and “C” separately added at itsfree terminus. All four oligonucleotides should be immobilized in thesame area of the array. Two degenerate positions can be added at animmobilized oligonucleotide's terminus resulting in sixteenoligonucleotides immobilized in the same area.

“Binary arrays” according to this invention contain immobilizedoligonucleotides that are comprised of two segments, one of which isvariable, and the other of which is constant. The same sequence ispresent in the constant segment of all such oligonucleotides in thebinary array. The variable segments can vary in both the sequence andthe length. A binary array is illustrated in FIGS. 1 and 1 a. FIG. 1shows a substrate or support 1 having immobilized thereon an array ofoligonucleotides 3, each oligonucleotide being in a separate area 2 ofsupport 1. FIG. 1 a shows one area 2. A binary oligonucleotide 3comprised of constant region 5 and variable region 6 is covalently boundto support 1 by covalent linking moiety 4. Of course, many identicaloligonucleotides are immobilized to the same area.

The number of different oligonucleotides as well as the number of areasis the same for a comprehensive binary array as for a comprehensiveordinary array having the same set of variable segments. In acomprehensive binary array, every possible segment adjacent to thechosen segment in a strand can be hybridized perfectly to one or morevariable segments of the immobilized oligonucleotides, the chosensegment being complementary to all or part of the constant segment ofthe immobilized oligonucleotide. Such an array has the property that ifa strand possesses the chosen segment it will be hybridized somewhere inthe array.

It is possible, of course, to include on the same support additionalareas having other oligonucleotides, for example, oligonucleotides nothaving a constant region.

Because of the constant segments in the immobilized oligonucleotides,binary arrays provide means for the hybridization of longer sequenceswithout increasing the size of the array. The constant segment can belocated within the immobilized oligonucleotide either “upstream” of thevariable segment (i.e., toward or at the 5′ end of the oligonucleotide)or “downstream” from the variable segment (i.e., toward or at the 3′ endof the oligonucleotide). The type of array that is chosen depends on thespecific application to which the array is put. The constant regionpreferably is or includes a good priming region for amplification ofhybridized strands by PCR, or a promoter for copying the strand bytranscription. Generally a length of 15 to 25 nucleotides is suitablefor priming. The constant region can contain all or part of thecomplement of a restriction site. A binary array can be “plain” or“sectioned” (see below).

“Plain arrays” known in the art are arrays in which the individualoligonucleotide areas are not physically separated from one another.Many reactions can be carried out simultaneously on a plain array;however, they are limited to those in which the nucleic acid templatesand the reaction products are bound in some manner to the surface of thearray to avoid the intermixing of products generated in different areas.

“Sectioned arrays” are oligonucleotide arrays that are divided intosections, so that each area is physically separated by mechanical orother means (e.g., a gel) from all the other areas, e.g., depressions onthe surface, called a “well”. There are many techniques apparent to oneskilled in the art for preventing the exchange of materials betweenareas; any such method can be used to make a “sectioned” array, as thatterm is used herein, even though there might not be a physical wallbetween areas. For example, the contents of the areas can be preventedfrom mixing by solidifying or gelling the solution.

One type of sectioned array is illustrated in FIGS. 2 and 2 a. FIG. 2shows a support sheet 60 having an array of depressions or wells 62,each containing an immobilized oligonucleotide 64. FIG. 2 a shows onewell 62 of the array of FIG. 2. Well 62 formed in support 60 has thereinoligonucleotide 64 covalently bound to support 60 by covalent linkingmoiety 66. Of course, many identical oligonucleotides are bound to thesurface of each well. In practice one may prepare a plain array, e.g.,an array on a flat sheet, and then, at a point during a series of stepsinvolving its use, convert the array into a sectioned array, e.g., bymaking physical depressions in a deformable solid support to isolate theindividual areas in each depression. The sectioned array can also becreated by applying a lattice to the solid support and bonding it to thesurface so that each area is surrounded by impermeable walls. Thetechnique of application of the lattice to the support is not critical;such means are well known in the art and include using adhesives andheat bonding. The areas of the array should be separated in a watertight manner. An exploded perspective view of such a sectioned array isshown in FIG. 3. Support or substrate 70, here a planar sheet, hasmounted thereon and affixed thereto a lattice 72 comprised of a seriesof horizontal members 74, 76. The lattice members define a series ofopen areas which, in conjunction with support 70, define an array ofwells 78. In some applications it is preferable to utilize a detachablelattice (or a removable cover sheet), so that the sectioned array can beconverted back to a plain array. Oligonucleotides can be immobilized onthe inner surface of the walls of the lattice, rather than on thebottoms of the wells. Irrespective of whether an array is sectionedpermanently or temporarily, it is called herein a sectioned array. It isanticipated that the intermixing of the contents of an array can even beprevented by simply withdrawing materials by means of suction from eacharea as they are produced. A sectioned array allows reactions to beperformed simultaneously in individual areas, both on the moleculesattached to the surface of the array and on the molecules contained inthe solution in each well. For some applications, it is particularlyadvantageous to use an array that is both sectioned and contains binaryoligonucleotides, i.e., “sectioned binary arrays.”

Sectioned arrays according to this invention can be used to increase thespecificity of hybridization of nucleic acids to the immobilizedoligonucleotides. After hybridization, unhybridized strands can bewashed away. Hybridized strands can then be released into solutionwithout mixing materials present in different wells. Released strandscan be rebound to the oligonucleotides immobilized on the surface, andunhybridized strands can be washed away. Each successive release,rebinding, and washing increases the ratio of perfectly matched hybridsto mismatched hybrids.

“Replica arrays” are sectioned arrays that are used to receive nucleicacids from the wells of a first array, such as by printing or blotting.The replica array can contain immobilized oligonucleotides arranged insuch a manner that the replica array is a mirror image of the originalarray, or the replica array can be a blank array. A blank array, unlike“arrays” as used elsewhere herein, does not contain immobilizedoligonucleotides. Its surface can be modified by, for example, weakanion-exchange groups (such as diethylaminoethyl groups) to keep thetransferred nucleic acids in place. A replica array can initially be aflat sheet, and after the transfer a lattice can be applied to thesheet, to produce a sectioned array. To make the transfer more accurate,the buffer filling the original array can contain alow-gelling-temperature agarose. This buffer remains liquid at thehigher temperatures that are required for strand amplification, but agel forms when the array is chilled. In this case, a cover sheet plus alattice can serve as a replica array. The cover sheet is first bonded tothe lattice that forms the wells of the original array. After theagarose is converted to a gel by chilling, the original array isdetached from the lattice and replaced by a new sheet.

An array can be “3′” or “5′”. “3′ arrays” possess free 3′ termini and“5′ arrays” possess free 5′ termini. The immobilized oligonucleotides inthe arrays can be used for hybridization or ligation to nucleic acidstrands present in solution as part of certain methods of the invention.The immobilized oligonucleotides in a 3′ array can be extended at their3′ termini by incubation with a nucleic acid polymerase. If the nucleicacid polymerase is a template-directed polymerase, only immobilizedoligonucleotides that are hybridized to a nucleic acid template strandcan be extended. The immobilized oligonucleotides in a 5′ array cannotbe so extended because of the nature of currently known polymerases.

It is of course possible to add to the array, if desired, areascontaining oligonucleotides having the same sequence as those in anotherarea.

It is not necessary that all oligonucleotides immobilized in each areahave the same sequence. For example, an array containingoligonucleotides might contain in an area the oligonucleotides (constantor variable) “AAAAAAA”, “AAAAAAT”, “AAAAAAG” and “AAAAAAC”. Such acollection of oligonucleotides would be capable of hybridizing to thehexameric sequence “AAAAAA” in addition to any other nucleotide at itsterminus. Such an increase in the length of the hybrids effectivelyresults in the same strands being hybridized in that area, and increasesthe length of the oligonucleotides, possibly allowing the hybrid to beformed at a more convenient temperature. The added nucleotide can be,for example, at the free end or at the immobilized end of theoligonucleotide.

It is also not necessary that all of the constant regions be the same inall of the areas of the array. An array might be broken up into severalregions, each utilizing a different constant region.

It is also possible to add additional sequences to the constant andvariable segments in a binary array. For example, it is possible to makea trinary, or quaternary array according to the invention, in which theimmobilized oligonucleotides in those arrays contain a constant segmentand a variable segment in addition to further segments which arevariable or constant.

In some applications, it may be advantageous to use a comprehensivearray obtained by combining oligonucleotides in several areas into onearea. This array will retain the property of a comprehensive array thatany possible strand segment is able to be hybridized somewhere in thearray, although the number of areas in such an array will be smaller.For example, rather than having four oligonucleotides that differ in oneposition and are immobilized in four separate areas of a comprehensivearray, it may be convenient to immobilize all of these fouroligonucleotide in one area. Thus, instead of having the sequences“AAAAAAA”, “AAATAAA”, “AAAGAAA”, and “AAACAAA” in separate areas, acomprehensive array might be obtained if they are contained in the samearea. This would be analogous to having in this area an oligonucleotidewith one position that is degenerate.

The length of the immobilized oligonucleotides on the arrays usedaccording to the invention depends upon many considerations discussedherein. One consideration is the ability to discriminate perfectlymatched hybrids from mismatched hybrids. In an ordinary array, thelength of the immobilized oligonucleotides should be between about sixand about thirty nucleotides. In a binary array where the entire lengthof the immobilized oligonucleotide is intended to hybridize to a strand,the immobilized oligonucleotides should also be between six and thirtynucleotides long. If, however, only part of the oligonucleotideimmobilized in a binary array is intended to hybridize to a strand, suchas where the immobilized oligonucleotide is pre-hybridized to a maskingoligonucleotide, then the length of the region intended to hybridize tothe strand should preferably be between six and thirty nucleotides long,i.e., the immobilized oligonucleotide can be longer. This can beachieved by having the length of the constant segment be no longer thanone nucleotide, in combination with a longer variable segment, or visaversa.

Suitable substrates or supports for arrays should be non-reactive withreagents to be used in processing, washable under stringent conditions,not interfere with hybridization, and not be subject to inordinatenon-specific binding. They must be amenable to covalent linking ofoligonucleotides. In many cases it is preferred that the supports belong lasting and not subject to deterioration. Suitable supportmaterials are well known. They include, for example, treated glass,polymers of various kinds (e.g., polyamide and polyacrylmorpholide),latex-coated substrates, and silica chips.

There are a number of different ways to manufacture oligonucleotidearrays. Many methods for the immobilization of oligonucleotides ondifferent solid supports are known in the art, examples of which follow.The support can be made of glass and the surface can be coated with longaminoalkyl chains [Ghosh, S. S. and Musso, G. F. (1987). CovalentAttachment of Oligonucleotides to a Solid Support, Nucleic Acids Res.15, 5353-5372]. The support can be a polyacrylamide layer [Khrapko, K.R., Lysov, Yu. P., Khorlin, A. A., Shik, V. V., Florentiev, V. L., andMirzabekov, A. D. (1989). An oligonucleotide Hybridization Approach toDNA Sequencing, FEBS Lett. 256, 118-122], or a latex-covered surface[Kremsky, J. N., Wooters, J. L., Dougherty, J. P., Meyers, R. E.,Collins, M. and Brown, E. L. (1987). Immobilization of DNA viaOligonucleotides Containing an Aldehyde or Carboxylic Acid Group at the5′ Terminus, Nucleic Acids Res. 15, 2891-2909], or a surface coveredwith various polymers [Markham, A. F., Edge, M. D., Atkinson, T. C.,Greene, A. R., Heathcliffe, G. R., Newton, C. R. and Scanlon, D. (1980).Solid Phase Phosphotriester Synthesis of Large Oligoribonucleotides on aPolyamide Support, Nucleic Acids Res. 8, 5193-5205; Norris, K. E.,Norris, F. and Brunfeldt, K. (1980). Solid Phase Synthesis ofOligonucleotides on a Crosslinked Polyacrylmorpholide Support, NucleicAcids Symp. Ser. 7, 233-241; Zhang, Y., Coyne, M. Y., Will, S. G.,Levenson, C. H. and Kawasaki, E. S. (1991). Single-base MutationalAnalysis of Cancer and Genetic Diseases Using Membrane Bound ModifiedOligonucleotides, Nucleic Acids Res. 19, 3929-3933].

Methods of oligodeoxyribonucleotide synthesis directly on a solidsupport are also known in the art, including methods wherein synthesisoccurs in the 3′ to 5′ direction (so that the oligonucleotides willpossess free 5′ termini) [Caruthers, M. H., Barone, A. D., Beaucage, S.L., Dodds, D. R., Fisher, E. F., McBride, L. J., Matteucci, M.,Stabinski, Z. and Tang, J.-Y. (1987). Chemical Synthesis ofDeoxyoligonucleotides by the Phosphoramidite Method, Methods Enzymol.154, 287-313; Horvath, S. J., Firca, J. R., Hunkapiller, T.,Hunkapiller, M. W. and Hood, L. (1987). An Automated DNA SynthesizerEmploying Deoxynucleoside 3′-phosphoramidites, Methods Enzymol. 154,314-326], and methods wherein synthesis occurs in the 5′ to 3′ direction(so that the oligonucleotides will possess free 3′ termini) [Agalwal, K.L., Yamazaki, A., Cashion, P. J. and Khorana, H. G. (1972). ChemicalSynthesis of Polynucleotides, Angew. Chem. 11, 451-459; Belagaje, R. andBrush, C. K. (1982). Polymer Supported Synthesis of Oligonucleotides bya Phosphotriester Method, Nucleic Acids Res. 10, 6295-6303; Rosenthal,A., Cech, D., Veiko, V, P., Orezkaja, T. S., Kuprijanova, E. A. andShabarova, Z. A. (1983). Triester Solid Phase Synthesis ofOligodeoxyribonucleotides on a Polystyrene-teflon Support, TetrahedronLett. 24, 1691-1694; Barone, A. D., Tang, J.-Y. and Caruthers; M. H.(1984). In situ Activation of Bis-dialkylaminophosphines—A New Methodfor Synthesizing Deoxyoligonucleotides on Polymer Supports, NucleicAcids Res. 12, 4051-4061].

Methods for synthesizing oligoribonucleotides, and methods forsynthesizing mixed oligo(ribo/deoxyribo)nucleotides, on a solid supportare also known in the art [Veniaminova, A. G., Gorn, V. V., Zenkova, M.A., Komarova, N. I. and Repkova, M. N. (1990). Automated H-PhosphonateSynthesis of oligoribonucleotides Using 2′-O-tetrahydropyranylProtective Groups, Bioorg. Khim. (Moscow) 16, 941-950; Romanova, E. A.,Oretskaia, T. S., Sukhomlinov, V. V., Krynetskaia, N. F., Metelev, V. G.and Shabarova, Z. A. (1990). Hybridase Cleavage of RNA. II. AutomaticSynthesis of Mixed Oligonucleotide Probes, Bioorg. Khim. (Moscow) 16,1348-1354; Scaringe, S. A., Francklyn, C. and Usman, N. (1990). ChemicalSynthesis of Biologically Active oligoribonucleotides Using β-CyanoethylProtected Ribonucleoside Phosphoramidites, Nucleic Acids Res. 18,5433-5441].

The simultaneous synthesis of many different oligonucleotides is alsoknown in the art [Frank, R., Meyerhans, A., Schwellnus, K. and Blocker,H. (1987). Simultaneous Synthesis and Biological Applications of DNAFragments: An Efficient and Complete Methodology, Methods Enzymol. 154,221-249 (1987); Djurhuus, H. W., Staub, A. and Chambon, P. (1987). TheSegmented Paper Method: DNA Synthesis and Mutagenesis by RapidMicroscale “Shotgun Gene Synthesis”, Methods Enzymol. 154, 250-287].

Arrays are suitable for automated delivery of four different nucleotideprecursors to precise locations within the array using acomputer-controlled device similar to the devices used in multicolorinkjet printers (such as the DeskWriter C, manufactured byHewlett-Packard), based on “drop-on-demand” technology. This method isparticularly useful for the synthesis of oligonucleotides on arrays thatare already sectioned. An even higher efficiency of oligonucleotidesynthesis and a higher density of areas of immobilized oligonucleotidescan be achieved by using photolithography techniques [Fodor, S. P.,Read, J. L., Pirrung, M. C., Stryer, L., Lu, A. T. and Solas, D. (1991).Light-directed, Spatially Addressable Parallel Chemical Synthesis,Science 251, 767-773].

Arrays can be made over a wide range of sizes. In the example of asquare sheet, the length of a side can vary from a few millimeters toseveral meters. An array of 256-by-256 areas on 2 mm centers, forexample, would be more than a half meter on a side. Miniaturized arraysfor surveying, manufactured by using microchip technology, would beorders of magnitude smaller.

There are many useful ways in which the elements of an array can bearranged. The most efficient shape for an array can depend on theparticular design of any robotic array-handling device used, any methodused to control temperature across the array (see below), and any methodused to detect hybrids. The individual areas to which theoligonucleotides are immobilized can be arranged on a surface in variouspatterns, such as, for example, a square, rectangular, linear,concentric, or spiral pattern. The arrays may be rigid or flexible. Forexample, they may even be in the form of a tape that is wound up on areel or cassette.

Sophisticated arrangements can be used in order to place the differentoligonucleotides at positions that correspond to the stability (T_(m))of the hybrids they form. Such an arrangement can be used to increasethe specificity of hybridization to the array. For example, an array canbe mounted on a plate constructed of a heat-conducting material, such asmetal, whose opposite edges are kept at different controlledtemperatures (for example, the side along one edge can be heated and theother cooled). These can be the opposite edges of a square, a rectangle,a cylinder, or the inner and outer edges of a disk with a hole in themiddle. Moreover, the temperature gradient need not be uniform. Theshape of the array or the thickness of the supporting material can bevaried in order to alter the distribution of heat through the supportingmaterial. The oligonucleotides should then be arranged on the support insuch a manner that each area can be conveniently incubated at whatevertemperature is optimal for a preselected operation—hybridization,washing, or a subsequent enzymatic reaction such as ligation orpolymerization. Careful placement of the oligonucleotides within thearray can ensure that the highest degree of discrimination againstmismatched hybrids occurs. The optimal temperature for the formation ofeach perfect hybrid in the array can be determined in preliminaryexperiments, in which a mixture of all possible syntheticoligonucleotides, or a digest of nucleic acids of known sequence, arehybridized and then washed away at steadily increasing temperature,while simultaneously recording for each type of oligonucleotide thetemperature at which its hybrid dissociates (i.e. a “melting curve” foreach oligonucleotide can be established) [Khrapko et al., 1989].

There are a number of ways that solutions may be spread across largearrays, including sectioned arrays. For example, an array can be rolledon a rotating horizontally mounted cylinder that is slightly immersed ina tray filled with a solution, for example, a nucleic acid mixture.During hybridization or washing, the solution in the tray can be kepthot so that the nucleic acids will denature, and the cylinder can becooled by having the opposite edges of the cylinder be at differenttemperatures, thus forming a temperature gradient across the surface ofthe cylinder. The array can also be placed against the inside wall of arotating vertically mounted cylinder, such as a centrifuge, whose bottomand top are kept at different temperatures to form a temperaturegradient. The thin film of solution contacting the array surface cancontinuously be withdrawn from the top and be pumped back into thebottom, with, for example, the aid of a peristaltic pump, through aheating coil, in order to ensure that the nucleic acids in the solutionremain denatured. The progress of hybridization can be monitored with adensitometer that records the decrease in ultraviolet absorption in thesolution being recirculated. The array can also be mounted on a rotatingdisk, with the liquid being collected at the outer edge and thenreintroduced at the center.

II. Sorting Nucleic Acids

Our invention allows mixtures of nucleic acid strands, whether DNA orRNA, to be sorted according either to their terminal oligonucleotidesegments (“terminal sorting”) or their internal oligonucleotide segments(“internal sorting”) on a binary array.

There are two important aspects of our invention for sorting nucleicacids. First, each strand in a mixture can be made to hybridize to anarray at only a few, or a single, location. And second, each strand canbe provided with universal terminal priming regions that enable allstrands to be amplified by PCR without prior knowledge of the nucleotidesequences at the strands' termini, and without the need to synthesizeindividual primers for each strand.

For terminal sorting, the priming region(s) can be made essentiallydissimilar from the sequences occurring in the nucleic acids that arepresent in the mixture to be sorted, so that priming does not occuranywhere but at the strands' termini (the addition of priming regions tostrands is discussed below). The absence of priming within a strand'sinternal regions can be confirmed by checking the inability of theprimers chosen to hybridize to the strand mixture at temperatures wellbelow (e.g., by 10° C.) the temperature at which the polymerizationreaction is carried out. When strands from a complete restriction digestof a DNA are to be sorted by their termini and then amplified, primingonly at the strands termini can be promoted by restoring the terminalrestriction sites (those sites having been eliminated from internalregions by complete digestion) concomitant with the generation ofterminal priming regions (see Example 1.1, below). Restriction sites arethereby uniquely found within the sequence of the terminal primingregions.

Universal terminal priming regions (that preferably include a restoredrestriction site) serve as “tags” that distinguish the terminaloligonucleotide segments from all internal segments. Terminal sorting iscarried out on a binary array, which preferably is a sectioned binaryarray. The immobilized oligonucleotides contain a constant segmentcomplementary to either the strands' 3′ priming region or 5′ primingregion. Thus, each strand can only be hybridized to one location withinthe array. By sorting on a comprehensive array, every strand is boundsomewhere within the array. This is especially important for thepreparation of a comprehensive library of fragments of a long nucleicacid or a genome.

The 3′ and 5′ terminal priming regions can be introduced before or afterstrand sorting. Also, one priming region can be introduced beforesorting and another can be introduced after sorting (see Example 1.2,below). Methods of introducing the priming regions include ligation tooligonucleotide adaptors using either DNA ligase or RNA ligase, strandextension with a homopolymeric tail using terminal nucleotidetransferases, and combinations of these methods (see Examples 1.1 to1.3, below).

Strands can be sorted on either 3′ or 5′ arrays in which the constantsegment is located either upstream or downstream of the variablesegment. High specificity of sorting can be achieved by employing 3′arrays in which the constant segment of the immobilized oligonucleotidesis located upstream from the variable segment. In that case, sorting canbe followed by the generation of an immobilized copy of each sortedstrand using the immobilized oligonucleotides as primers for thesynthesis of a complementary copy of that strand when the array isincubated with an appropriate DNA polymerase. This procedure provides anincrease in hybridization specificity, since hybrid extension by DNApolymerase is highly sensitive to terminal mismatches. A functionallyequivalent array is a 5′ array in which the constant segment is locateddownstream from the variable segment. In that case, a primer hybridizedto the 3′ end of the bound strand can be extended with a polymerase andthe product ligated to the 5′ end of the immobilized oligonucleotide. Inboth of these two cases the generation of nucleic acid copies that arecovalently linked to the array surface enables the arrays to bevigorously washed to remove non-covalently bound material before strandamplification. It also enables the arrays to serve as permanent banks ofsorted nucleic acid strands which can subsequently be amplified over andover to generate copies for further use. Exemplary methods are givenbelow in Examples 1.1 to 1.3.

A strand sorting procedure of the invention is illustrated in FIG. 4. ADNA sample 10 is completely digested with a restriction endonuclease.The ends of each fragment are restored, and universal priming sequences17 generated in the process to prepare fragments 11 for sorting. It isnot necessary that priming sequences be added at both ends, if onlylinear amplification is desired. Nor is it necessary that the primingsequence at the 3′ end of a strand be the same as the priming sequenceat the 5′ end.

The strands are then melted apart 12 and hybridized to a terminalsequence binary sorting array, whose immobilized oligonucleotides 14contain a variable segment 15 and a constant segment 16 which iscomplementary to the universal priming region 17, including the restoredrecognition site of the restriction enzyme 16 a, 17 a. Each strand is ata location dependent upon its variable sequence 100 adjacent to itspriming sequence. At this point the array need not be a sectioned array;it may be a plain array. The array is then washed to remove unhybridizedstrands. The entire array is then incubated with DNA polymerase.Consequently, a complementary copy 18 of each hybridized DNA strand isgenerated by extension of the 3′ end of the oligonucleotide to which thestrand is bound. The array is then vigorously washed to remove theoriginal DNA strands and all other material not covalently bound to thesurface (not shown).

The covalently bound copy strands can be amplified. During theamplification reaction it is usually desirable that the array besectioned. The wells are filled with a solution containing universalprimers 19, 20, an appropriate DNA polymerase, and the substrates andbuffer needed to carry out a polymerase chain reaction. The array can,if desired, be sealed with a coversheet, further isolating the wellsfrom each other. A polymerase chain reaction is carried outsimultaneously in each well of the array. This procedure results insorting the mixture of strands into groups of strands that share thesame terminal oligonucleotide sequence, each strand (or each group ofstrands) being present in a different well of the array and amplifiedthere.

The most important factor determining the purity of the sorted strandsis the specificity of the hybridization between the nucleic acid strandsand the immobilized oligonucleotides, i.e., the ratio of the amount ofperfectly matched (legitimate) hybrids to the amount of mismatched(illegitimate) hybrids after the hybridization step is completed. Ingeneral, perfect hybrids are more stable than mismatched hybrids, andtheir relative stability is dependent upon a variety of factors, such astemperature, concentration of denaturing agents, the presence andconcentration of divalent metal ions, and ionic strength. By adjustingthese conditions, differences in stability between the perfect hybridsand hybrids containing a single mismatch can be increased to as high astwo orders of magnitude [Wilson, K. H., Blitchington, R., Hindenach, B.and Greene, R. (1988). Species-specific Oligonucleotide Probes for rRNAof Clostridium difficile and Related Species, J. Clin. Microbiol. 26,2484-2488; Zhang, Y., Coyne, M. Y., Will, S. G., Levenson, C. H. andKawasaki, E. S. (1991). Single-base Mutational Analysis of Cancer andGenetic Diseases Using Membrane Bound Modified Oligonucleotides, NucleicAcids Res. 19, 3929-3933].

Methods to increase hybridization specificity and the specificity of thepolymerase chain reaction are known in the art [Wallace, R. B., Shaffer,J., Murphy, R. F., Bonner, J., Hirose, T. and Itakura, K. (1979).Hybridization of Synthetic Oligodeoxyribonucleotides to φX174 DNA: TheEffect of Single Base Pair Mismatch, Nucleic Acids Research 6,3543-3557; Conner, B. J., Reyes, A. A., Morin, C., Itakura, K., Teplitz,R. L. and Wallace, R. B. (1983). Detection of Sickle Cell β^(s)-globinAllele by Hybridization with Synthetic Oligonucleotides, Proc. Natl.Acad. Sci., U.S.A. 80, 278-282; Wallace, R. B., Studencki, A. B. andMurasugi, A. (1985). Application of Synthetic Oligonucleotides to theDiagnosis of Human Genetic Diseases, Biochimie 67, 755-762; Saiki, R.R., Bugawan, T. L., Horn, G. T., Mullis, K. B. and Erlich, H. A. (1986).Analysis of Enzymatically Amplified β-globin and HLA-DQα DNA withAllele-specific Oligonucleotide Probes, Nature 324, 163-166; Miyada, C.G. and Wallace, R. B. (1987). Oligonucleotide Hybridization Techniques,Methods Enzymol. 154, 94-107; Saiki, R. K., Walsh, P. S., Levenson, C.H. and Erlich, H. A. (1989). Genetic Analysis of Amplified DNA withImmobilized Sequence-specific Oligonucleotide Probes, Proc. Natl. Acad.Sci., U.S.A. 86, 6230-6234; Wu, D. Y., Nozari, G., Schold, M., Conner,B. J. and Wallace R. B. (1989). Direct Analysis of Single NucleotideVariation in Human DNA and RNA Using in situ Dot Hybridization, DNA 8,135-142; Wu, D. Y., Ugozzoli, L., Pal, B. K. and Wallace, R. B. (1989).Allele-specific Enzymatic Amplification of Beta-globin Genomic DNA forDiagnosis of Sickle Cell Anemia, Proc. Natl. Acad. Sci., U.S.A. 86,2757-2760; Drmanac, R., Strezoska, Z. Labat, I., Drmanac, S. andCrkvenjakov, R. (1990). Reliable Hybridization of Oligonucleotides asShort as Six Nucleotides, DNA Cell Biol. 9, 527-534; Nielson, K. andMathur, E. J. (1990). Perfect Match Enhancer: Limits False PrimingEvents During Amplification Reaction, Strategies In Molecular Biology (AStratagene newsletter) 3, 17-22; Nielson, K., Wilbanks, A., Hansen., C.and Mathur, E. J. (1991). Improve Specificity of Long AmplificationProducts with Perfect Match Polymerase Enhancer, Strategies In MolecularBiology (A Stratagene newsletter) 4, 38; Erlich, H. A., Gelfand, D. andSninsky, J. J. (1991). Recent Advances in the Polymerase Chain Reaction,Science 252, 1643-1651; Lundberg, K. S. and Mathur, E. J. (1991).Optimization of Perfect Match Polymerase Enhancer for the PolymeraseChain Reaction, Strategies In Molecular Biology (A Stratagenenewsletter) 4, 4-5].

Terminal mismatches have the least effect on hybrid stability and arethe most difficult to discriminate against [Drmanac, R., Strezoska, Z.,Labat, I., Drmanac, S. and Crkvenjakov, R. (1990). ReliableHybridization of Oligonucleotides as Short as Six Oligonucleotides, DNACell Biol. 9, 527-534]. Embodiments, discussed below, in which hybridsare extended at both ends, through enzymatic ligation to a maskingoligonucleotide (an oligonucleotide that is hybridized to, and covers apart of, the constant segment of the immobilized oligonucleotide) at oneend and through enzymatic extension at the other end, are highlysensitive to terminal mismatches (see Examples 1.2 and 1.3, below).

Another difficulty in achieving perfect hybrids in each area of an arrayis the different intrinsic stability of hybrids that contain A:T and G:Cbasepairs in different proportions. It has been found that highconcentrations of tetraalkylammonium salts in a hybridization solutionminimize these differences, so that the stability of the hybrids can bemade to be dependent on only the length of the hybrids [Wood, W. I.,Gitschier, J., Lasky, L. and Lawn, R. M. (1985). BaseComposition-Independent Hybridization in Tetramethylammonium Chloride: AMethod for Oligonucleotide Screening of Highly Complex Gene Libraries,Proc. Natl. Acad. Sci. U.S.A. 82, 1585-1588; Jacobs, K. A., Rudersdorf,R., Neill, S. D., Dougherty, J. P., Brown, E. L. and Fritsch, E. F.(1988). The Thermal Stability of Oligonucleotide Duplexes is SequenceIndependent in Tetraalkylammonium Salt Solutions: Application toIdentifying Recombinant DNA Clones, Nucleic Acids Res. 16, 4637-4650].This approach can be used, for example, in hybridization of strandswhose termini have been provided with priming regions prior to sortingand when the immobilized oligonucleotides are all of the same length.However, if hybridization is to be coupled to an enzymatic reaction,such as ligation to a masking oligonucleotide, high salt concentrationscan be inhibitory. This method also does not apply when the length ofthe immobilized oligonucleotides vary. Another solution for overcomingthe problem of different hybrid stabilities consists of applying atemperature gradient across an array, wherein different oligonucleotidesare arranged according to the thermal stability of their correspondinghybrids (see Section I, above). In this case, enzymatic reactions can becarried out by utilizing mixtures of enzymes with different temperatureoptima, ensuring equal reaction efficiency in all wells.

By carrying out hybridizations on sectioned arrays the specificity ofhybridization can be significantly increased above the level that iscurrently achievable. Because wells are physically isolated from oneanother, the hybridized strands can repeatedly be released into solutionwithout mixing of material in different wells, and rebound to theimmobilized oligonucleotides, followed by washing the array to removeunhybridized strands. Alternatively, the released strands can be reboundto a fresh replica array to eliminate the background that results fromthe non-specific binding of strands to the array surface. In eachsucceeding cycle of hybridization, only those strands that have beenbound in the previous cycle are available to hybridize. Therefore, theratio of the perfect hybrids to mismatched hybrids increases as anexponential function of the number of cycles. The number of cyclesrequired to achieve a desired ratio of perfect hybrids to mismatchedhybrids for a particular embodiment is determinable in preliminaryexperiments. If mixtures of nucleic acids of known sequences are used inthese experiments, the cycling is repeated until only the legitimatestrands are detected (for example, by gel electrophoresis oroligonucleotide probe hybridization) in each well after strandamplification. The test experiments can also be carried out withmixtures of nucleic acids whose sequences are unknown. In this case, thenumber of different strands in a mixture should be less than the numberof different oligonucleotides in the array, and the cycles repeateduntil the number of empty wells after strand amplification remainsconstant. The inevitable loss of legitimate strands during the cyclingprocedure need not be troublesome, since the number of remaining strandsneeded to reliably initiate subsequent PCR can be as low as 100 [Myers,T. W. and Gelfand, D. H. (1991). Reverse Transcription and DNAAmplification by a Thermus thermophilus DNA Polymerase, Biochemistry 30,7661-7666]. In those embodiments where priming regions are introducedinto the termini of the strands prior to sorting, reversiblehybridization cycling is performed after the strands are first bound tothe array. If priming regions are introduced by ligation of thehybridized strands to masking oligonucleotides, then cycling isperformed after the ligation step.

The results of hybridization can be improved further by “proofreading”,or editing, the hybrids formed, by selectively destroying those hybridsthat contain mismatches, without affecting perfect hybrids. Variousmeans of hybrid proofreading by chemical and enzymatic methods arediscussed in detail herein (see Example 5.1.1, below).

The necessary level of hybridization specificity depends on thecomplexity of the sorted nucleic acid mixture, and on the particular useto which the sorted strands will be put. Therefore, the above methodsfor improving specificity need not be used in every case.

The length of the immobilized oligonucleotides in a strand sorting arrayis chosen to suit the number of strands to be sorted. When sortingstrands according to their terminal sequences, the number of differentstrands obtained in each well equals the number of times that aparticular oligonucleotide complementary to the variable segment of theimmobilized oligonucleotide occurs among the termini of differentstrands in the mixture. If the number of nucleotides in each variablesegment is n, then the total number of such variable sequences is 4^(n),and the mean number of different strands in a well is N/4^(n), where Nis the number of different strands in the mixture, provided thatnucleotide sequence is random, and that each of the four nucleotides ispresent in equal proportion. If a random sequence that is the size of anentire diploid human genome (6×10⁹ basepairs) is completely digested bya restriction endonuclease that has a hexameric recognition site, thenthe resulting mixture will contain approximately 3×10⁶ strands with anaverage length of 4,096 nucleotides. If this mixture is then applied toa comprehensive binary array having variable segments eight nucleotideslong, then each well will contain, on average, approximately 45different strands. A similar degree of sorting (i.e., approximately thesame number of different strands in a well) will be achieved if a randomsequence that is the size of an entire diploid Drosophila genome (3×10⁸basepairs) is digested with a restriction endonuclease that has ahexameric recognition site, and is applied to an array whose variablesegments are six nucleotides long, or if it is digested with arestriction endonuclease having a tetrameric recognition site and isapplied to an array whose variable segments are eight nucleotides long.Similarly, the same degree of sorting can be achieved if a randomsequence that is the size of an Escherichia coli genome (5×10⁶basepairs) is sorted on an array containing trinucleotide-long variablesegments after digestion by a restriction endonuclease that has ahexameric recognition site, or if it is sorted on an array containingpentameric variable segments after digestion by a restriction enzymethat has a tetrameric recognition site. An increase in the length of thevariable segments, or the use of a restriction endonuclease that has alonger recognition site, will result in there being fewer differentstrands per well.

The actual number of strands in each well can differ significantly fromthe mean. This is especially true for real nucleic acids that do nothave random sequences, and wherein the proportion of the four differentnucleotides is usually unequal. For example, the content of A and Tnucleotides in the human genome is about 1.5 times higher than that of Gand C nucleotides. This will result in some wells containing fewer thanthe mean number of strands, and some yells containing many more. Theremay be too many strands in a well for some subsequent uses (e.g., forsequencing).

In cases where overloaded wells are a problem, our invention providesmeans to overcome the problem. If the material to be sorted is a mixtureof double-stranded fragments, such as DNA fragments produced byrestriction endonuclease digestion, the fragments are melted into singlestrands before hybridization to a sectioned oligonucleotide array. If,for example, the strands are sorted by their 3′ termini on a binarysectioned array, the complementary strands from the same double-strandedfragment will sort into different wells of the array, because their3′-terminal sequences are almost always different. A subsequentamplification of the sorted strands by symmetric PCR results in both thecomplementary strands being produced in each of the two wells of thearray. If by chance one of the two wells is overloaded, it is highlyunlikely that the other well will also be overloaded. Thus, despite theuneven distribution of strands among wells, virtually every strand canbe found in a well that is occupied with a moderate number of strands(i.e., a number that does not significantly exceed the mean).

Our invention also provides an option for directly monitoring the numberof different strands in each well, and for predicting whether thestrands that are present in an overloaded well can each be found amongwells that are not overloaded. After strands have been sorted andamplified by symmetric PCR, the wells are surveyed for “signatureoligonucleotides” with special binary survey arrays discussed below. Inthis application, a signature oligonucleotide consists of the sequenceof the terminal restriction site (such sites having been substantiallyeliminated from internal regions during the prior restrictionendonuclease digestion) and an adjacent variable segment, and thusidentify the terminal sequences of each strand in a well. If strands aresorted by their 3′ termini, each strand in a well will possess the same3′ terminal signature oligonucleotide, but the strands will almostalways possess different 5′-terminal signature oligonucleotides.Similarly, complementary copies of these strands (that are generatedduring symmetric PCR) will possess identical 5′-terminal signatureoligonucleotides, but different 3′-terminal signature oligonucleotides.By determining the number and identity of signature oligonucleotides ateither the 5′ end or the 3′ end of the strands in each well, it ispossible to directly count the number of different strands in the well,and to determine in which other wells the strands from a particular wellare also found (i.e. into which wells their complementary strands havebeen sorted). If each of these wells is not overloaded, the overloadedwell can be ignored for sequencing.

If necessary or desired, the mixture of strands from a highly populatedwell can be further divided into smaller groups, by sorting according totheir 5′ termini (in which case, direct copies will be sorted intogroups), or according to their 3′ termini (in which case, complementarycopies will be sorted into groups). Even very small arrays can beeffective for this purpose. For example, if it is found by surveying, asdescribed above, that after strand sorting by 3′-terminal sequences andamplification by symmetric PCR, a well contains, say 1,000 different3′-terminal signature oligonucleotides (which means that there are some2,000 strands in the well, including both direct and complementarycopies), the mixture can then be sorted into 64 groups on a terminalbinary sectioned array whose variable segments are as short as threenucleotides. If the second sorting is also carried out according to3′-terminal sequences, one of the groups will contain slightly more than1,000 strands (that includes all 1,000 direct copies from the firstsorting), and the other groups will contain, on average, 1,000/64≈16strands (due to the sorting of the complementary copies). This numberwill double after symmetric PCR amplification of the strands. If, froman examination of the survey results, it is determined that the wellwith slightly more than 1,000 strands does not contain strands foundonly in overloaded wells, that well can be ignored for sequencing. If,as is preferred, asymmetric PCR is carried out during the first sortingto only produce the complementary copies, then the mean number ofstrands will be ≈16 in all 64 groups (i.e., none of the wells will beoverloaded).

The ability to monitor the distribution of strands among wells helps tocontrol the number of strands in a group within certain limits,irrespective of the statistical nature of the sorting. If it is desiredto sort 3,000,000 human genome strands into groups of about 45 strands(e.g., for the determination of their sequences with the aid ofpartialing arrays, see below), one may choose to sort the strands on alarge binary sectioned array wherein the most populated well is expectedto contain not more than 45 strands. It is not necessary that thevariable segments in this array all be of the same length; rather, thelength of the variable segments can be chosen to suit the expectedfrequencies of different oligonucleotide segments in the human genome.For example, taking into account the higher content of A and Tnucleotides, the (A+T)-rich variable segments can be made longer thanthe (G+C)-rich variable segments. If it is desired to use acomprehensive array, then the array can be made comprehensive, asdescribed above. In such an array, most wells will contain fewer than 45strands, sometimes only a few strands. After each well of the array hasbeen surveyed for terminal signature oligonucleotides to determine theactual distribution of strands among the wells, the strands from chosenwells can be combined to obtain ≈65,000 groups with about 45 strands ineach.

According to our invention, as discussed further below, DNA fragmentsthat are not bounded by restriction sites can also be sorted onsectioned binary arrays by their terminal sequences (see Example 1.4,below).

Our invention also includes methods for isolating individual strands bysorting them according to the identity of their terminal sequences onsectioned binary arrays. The strands can be from restriction fragmentsor not, so long as unique priming sequences are added to at least one ofthe strand's termini, such as by methods described herein. If the numberof different DNA strands in a sample is rather small, there is a highprobability that after the first stage of sorting, many wells in thesectioned array will either not be occupied, or be occupied by only onetype of fragment. In the case of a complex mixture of DNA strands (suchas the mixture of strands that are obtained from the digestion of anentire human genome), a number of different types of fragments willoccupy each well of the sectioned array. In that case, the isolation ofindividual fragments can be achieved by PCR amplifying the strands ineach well in the first stage of sorting and then sorting the group offragments from each well on a fresh sectioned array. After symmetric PCRamplification, each well of the first array will contain copies of thestrands that were originally hybridized there, and also theircomplementary copies. If the original strands were sorted by their 3′ends, then their copies in a given well will all possess the same3′-terminal sequence, and their complementary copies will possess thesame 5′ end. However, the 3′-terminal sequences of the complementarycopies of the original strands in each well will be different (as willbe the 5′ terminal sequences of the original copies). Therefore, thecomplementary strands will bind at different locations within the newsectioned array, according to the identity of their own 3′-terminalsequences, and with a high probability, each of them will occupy aseparate well, where they can then be amplified. Alternatively, thesecond stage of sorting can be carried out according to the identity ofthe terminal sequences at the other end of each strand. For example, ifthe strands were sorted in the first stage by their 3′ ends (on an arraywhose immobilized oligonucleotides contain constant segments that areupstream of the variable segments), then the groups of strands from eachwell in the first array can be sorted in a second stage by their 5′termini (on an array whose constant segments are downstream of thevariable segments). In either procedure, as a result of the second roundof sorting, almost all of the different types of fragments are separatedfrom one another (with the exception of virtually identical allelicstrands from a diploid genome, which usually have identical termini, andconsequently are sorted into the same well). Other aspects of strandisolation are discussed herein (see Example 1.5, below). The isolatedstrands can then be used for any purpose. For example, they can beinserted into vectors and cloned, or they can be amplified and theirsequences determined using methods known in the art.

Our invention also includes the use of binary arrays for isolatingselected strands by sorting according to the identity of terminalsequences (see Example 1.6, below). Strands can, for example, beselected that contain particular regions (such as genes) of specialinterest from a clinical viewpoint. After the relevant portion of agenome has been sequenced, an array can be made using only preselectedoligonucleotides whose variable segments uniquely match the terminalsequences of the strands of interest, i.e., they would be long enough touniquely hybridize to the desired strands. Alternatively, strands ofinterest can be isolated by sorting on a sectioned array havingimmobilized thereon previously isolated selected genomic(single-stranded) fragments, rather than synthetic oligonucleotides. Inthis case, the isolation procedure will have much in common with thesorting of strands according to the identity of their internalsequences, which is discussed next.

Our invention also encompasses methods that include sorting DNAfragments according to their internal sequences (see Examples 2.1 and2.2, below). When sorting by internal sequences, the specificity ofsorting is, as a rule, lower than when sorting by terminal sequencesbecause the strands may be bound at more than one internaloligonucleotide. Thus, strands may bind at more than one well in thearray. However, this type of sorting can be useful for a number ofapplications, such as the isolation of strands that contain particularinternal sequence segments (utilizing a sectioned ordinary array), orthe sorting of strands according to the identity of variableoligonucleotide segments adjacent to internal restriction sites of aparticular type (utilizing a sectioned binary array). The latterapproach is useful for ordering sequenced restriction fragments (seeSection V, below). The sorting of strands by their internal segments ona 3′ sectioned ordinary array is useful for the generation of partialstrands by virtue of extension of the immobilized oligonucleotides (seeSection III, below).

Our invention includes the sorting, in particular for sequencing, ofnatural mixtures of RNA molecules, such as cellular RNAs. The sequencesof eukaryotic genes are usually interrupted by many large non-codinginserts, called introns. Following transcription, the introns areexcised from the RNA sequence, and the remaining segments, called exons,are linked together in a process called splicing, to produce messengerRNAs (Watson, J. D., Hopkins, N. H., Roberts, J. W., Steitz, J. A. andWeiner, A. M. (1987). Molecular Biology of the Gene, 4th edition, TheBenjamin/Cummings Publishing Co., Menlo Park). Establishing messengerRNA sequences is therefore useful not only for the identification andlocalization of genes in the genomic DNA, but also for providinginformation necessary to determine the coding gene sequences (i.e. theexon/intron structure of each gene). Furthermore, the analysis ofcellular RNAs in different tissues, at different stages of development,and in the course of a disease, will clarify which genes are active inthese instances and which are not. Usually, RNAs are short enough to besorted and analyzed without preliminary fragmentation. Details of RNAsorting are provided in Example 1.7, below.

III. Preparing Partial Strands of Nucleic Acids and Manipulating NucleicAcids on Sectioned Arrays

Our invention includes methods of using sectioned arrays for preparingall possible partial copies of a strand or a group of strands. Preparingcomplete sets of partials of a strand(s), and sorting the partials bytheir variable ends is especially useful in a process for determiningthe sequence of the strand or strands, as described herein. Thepreparation of partials is accomplished by either of the followingmethods: (1) terminally sorting on sectioned binary arrays a mixture ofpartial strands generated by degradation of a “parental” strand(s) atrandom; or (2) generating partials on a sectioned ordinary array,through the sorting of a parental strand(s) according to the identity ofthe strand's internal sequences, followed by the synthesis of(complementary) partial copies of the parental strand(s) by theenzymatic extension of the immobilized oligonucleotides, utilizing thehybridized parental strands as templates, and then copying theimmobilized partials. In either case, the partials that are generatedcorrespond to a parental strand whose 3′ or 5′ end is truncated to adifferent extent (the “variable” end), and whose other end is preserved(the “fixed” end). These are “one-sided partials”. By usingcomprehensive arrays, it is possible to prepare every possible one-sidedpartial of a strand.

In the first case (partialing before sorting), a strand, adouble-stranded fragment, a group of strands, or a group ofdouble-stranded fragments, carrying terminal priming regions, (these canbe a strand or a group of strands sorted on a sectioned binary array asdescribed above), is randomly degraded by a chemical or an enzymaticmethod, or by a combination of both (see Examples 3.1 and 3.2, below).Care is taken to ensure that partials of different length are producedin roughly equal proportion. Then the mixture of partials is sorted on asectioned binary array according to the identity of their newlygenerated termini, essentially as described above for the sorting offull-length strands by their terminal sequences, with new priming sitesbeing introduced at these new termini either before or after sorting.Only those partials that possess both the newly introduced priming siteand the already existing priming site (at the opposite end), will beamplified by subsequent PCR. Partials can be sorted according to theidentity of a variable sequence at either their 3′ termini or their 5′termini. However, as is the case for the sorting of full-length strands,the highest specificity can be achieved by sorting according to theidentity of a variable sequence at the 3′ termini, and carrying out thesorting on 3′ arrays having constant segments located upstream of thevariable segments, or by sorting according to the identity of a variablesequence at the 5′ termini, and carrying out the sorting on 5′ arrayshaving constant segments located downstream of the variable segments. Inthese cases, sorting can be followed by the generation of immobilized(complementary) copies of the sorted partials. The arrays with theimmobilized copies can serve as permanent banks of the sorted partialswhich can subsequently be amplified over and over to generate copies forfurther use. Following sorting, each well in the array will containimmobilized copies of all of those partials whose variable end iscomplementary to the variable segment of the immobilizedoligonucleotide. The other (fixed) end of these partials will beidentical to one of the ends of the parental strands. If anoligonucleotide segment occurs more than once in a strand, or if itoccurs in more than one strand in the group of strands subjected topartialing, then the well will contain a corresponding number ofdifferent partials, all sharing the same sequence at their variableends.

In the second case (sorting before partialing), partials are prepareddirectly from the parental strands that are hybridized to a sectionedordinary array without prior degradation of the nucleic acids. A strand,or a mixture of strands, is hybridized to a 3′ ordinary array. Theimmobilized oligonucleotides are then used as primers for copying thehybridized strands, beginning at the location within each bound strandwhere hybridization occurred, and ending at the upstream terminus ofeach bound strand. After extension of the immobilized oligonucleotides,the hybridized parental strands are discarded. At this point the wellscontain immobilized (complementary) partial strands. The partials in onewell all share a 5′-terminal oligonucleotide segment that iscomplementary to a particular internal oligonucleotide in the parentalstrand(s). The partial strands have 3′-terminal sequences that includethe complement of the 5′-terminal region of the parental strand(s)(which contains a priming region). Again, if an oligonucleotide occursmore than once in a strand, or if it occurs in more than one strand inthe group of strands subjected to partialing, then the well will containa corresponding number of different partials. Unlike the methodsdescribed above for partialing before sorting, the immobilizedcomplementary partials will contain a priming region at only one end andtherefore can not be amplified exponentially. However, their linearamplification is possible, with the partials being synthesized as DNAsor RNAs. Where RNA partials are generated, the priming region at thepartial copy's 3′ terminus contains an RNA polymerase promoter.Synthesis of RNA copies is more efficient than linear synthesis of DNAcopies. Alternatively, the synthesized copies can be provided withsecond priming regions by a variety of methods, and can then beamplified in an exponential manner by PCR. Examples of methods in whichpartials are generated on arrays are discussed in Example 3.3, below,and this approach for preparing partials is illustrated, schematically,in FIG. 5.

FIG. 5 illustrates the generation of partials for one DNA parentalstrand 30 on a 3′ sectioned ordinary array. First, the strand 30 (manycopies, of course) such as obtained from well 13 a of sorting array 13,is hybridized to the partialing array 31, a 3′ sectioned ordinary array,containing well 31 a. The parental strand 30 binds to many differentlocations within the array, dependent on which oligonucleotide segmentsare present in the strand. A hybrid 32 is formed in each well at thearray that contains an immobilized oligonucleotide complementary to astrand's oligonucleotide segment. After hybridization, the entire arrayis washed and incubated with an appropriate DNA polymerase in order toextend the immobilized oligonucleotides utilizing the hybridized strandas a template. Each extension product 33 strand is a partial(complementary) copy of the parental strand. Each partial begins at theplace 32 in the strand where hybridization occurred and ends at thestrand's terminus. The strand preferably terminates at its 5′ terminuswith a universal priming sequence 17, such as one introduced into allstrands when sorting strands on a sectioned binary array as describedpreviously. This allows for later amplification of the partials. Thatpriming sequence can contain a restored restriction site 16 a. Theparental strand may also contain, if it was previously sorted on abinary sorting array, a priming sequence at its 3′ terminus 17, adjacentto the variable sequence 100 that the strand was previously sorted by.

The entire array is then vigorously washed under conditions that removethe parental DNA strands and other material, preferably all, that is notcovalently bound to the surface. The individual areas of the array thencontain immobilized strands 33 that are complementary to a portion ofthe parental strand. The wells can then be filled with a solutioncontaining the universal primer (or promoter complement), an appropriatepolymerase, and the substrates and buffer needed to carry out multiplerounds of copying of the immobilized partial strands. The array can thenbe sealed, isolating the wells from each other, and (linear) copying canbe carried out simultaneously in all of the wells in the array.

The partialing array, containing the covalently bound complementarypartial copies 33 of the parental strands, can be stored and used atlater time for the generation of additional copies of the complete setof partials, or, if desired, only for the generation of additionalcopies of the partials contained in selected wells.

Embodiments for generating partials which employ degradation of nucleicacids and then sorting the resulting degraded (partial) strands by theirterminal sequences may have the following advantages as compared withthe method of preparing partials directly on an array (by sortingstrands by their internal segments): (1) introduction of priming regionsat both ends of the partials for subsequent exponential PCRamplification can be accomplished more easily using certain methods,described herein, to introduce priming regions into the degradedstrands; (2) secondary structures can interfere with hybridization ofnucleic acids to immobilized oligonucleotides, which interference tendsto be lessened when hybridization is by terminal sequences; and (3) itis often easier to prepare partials in roughly equimolar amounts,resulting in amplified products that also are roughly equimolar. On theother hand, the method of partialing directly on an internal sortingarray has the significant advantage of economy of processing.

Our invention also includes the preparation of partial copies of RNAs onsectioned arrays (see Example 3.4, below).

Methods for sequencing using partialing are described in detail below.Partialing has other uses as well. Our invention also includes the useof sectioned arrays for the isolation of desired individual partials ofnucleic acids whose sequences, or partial sequences, are already known.In most cases, these methods allow individual partials to be isolated,irrespective of whether one parental strand, or a group of parentalstrands, was used as the starting material for the partialing procedure,and irrespective of whether the particular oligonucleotide at thevariable end of a partial to be isolated occurs in a strand only once,or more than once. According to this aspect of the invention, partialsthat originate from different parental strands, and that share the samevariable end, are separated from each other by sorting according totheir fixed ends if these ends were not yet used for sorting theparental strands. The fixed ends of these partials originating fromdifferent parental strands contain variable regions (adjacent to anadded priming region at the fixed end) which are almost alwaysdifferent. Where the oligonucleotide at the variable end of a partial tobe isolated occurs in a parental strand two or more times, theindividual partials that share that oligonucleotide at their variableend, are isolated as follows. Instead of using parental strands as thestarting material for the generation of partials, the desired partial isgenerated from another partial, which is chosen so that the desiredpartial will be the longest partial amongst those that share thatvariable end. Then, the longest partial is separated from the shorterpartials by hybridizing it at an internal oligonucleotide that does notoccur in the shorter partials (Example 4.1). (The sequence of theparental strand has previously been determined.)

Our invention also allows the preparation of partials that correspond toa parental strand that is truncated to any extent from both ends. These“two-sided partials” are prepared in a two-stage procedure, each stageresulting in the truncation of one of the ends. The ability to preparetwo-sided partials means that the precise excision and isolation of anydesired segment of a nucleic acid is possible using the invention,without the need for restriction sites at the boundaries of the segment,and without the need to synthesize specific primers that embrace thatsegment (Example 4.2).

In making two-sided partials, methods described for making one-sidedpartials are employed. One-sided partials can be prepared by the methodof sorting strands by their internal segments on an array and thenextending the immobilized oligonucleotides, or by degrading strands andthen sorting them on an array according to their variable ends. Theone-sided partials have fixed ends and variable ends. The fixed ends cancontain priming regions. If the one-sided partials were prepared bydegradation and sorting, then both the fixed and the variable end canalso be provided with a priming region during sorting, as describedherein. To prepare two-sided partials, the strands from one well of thefirst array are partialed to truncate their former fixed ends. This canbe accomplished by using any of the means described for preparingone-sided partials. For example, complementary partials, preferablyhaving primers at both ends, can be hybridized to wells of an array andthe oligonucleotides immobilized in the array can then be extended toproduce partial copies that have their former fixed ends truncated.Either direct copies of the partials in the first array, or theircomplements, may be partialed in the second round of partialing. Thechoice of whether to use 3′ or 5′ arrays will be apparent to one skilledin the art. The resulting partials will have both termini truncated.

Priming regions can be added to ends of the partials, using the methodsdescribed herein. If it is desired to obtain a two-sided partial with noadded priming sequences, appropriate cleavable primers, describedherein, can be used for amplification.

The same array can be used for both rounds of partialing, and onlyselected wells in the array need be used.

Our invention also includes the use of sectioned arrays for themanipulation in a great variety of ways of a nucleic acid whose sequenceis known (or partially known), including methods for their recombinationand site-directed mutagenesis. These methods are based on the ability toprepare any desired partial of a nucleic acid strand according to theinvention, and utilize “cleavable primers” as discussed below. Cleavableprimers allow the substitution of new terminal priming regions for oldpriming regions, and allow the removal of a priming region from apartial's, or strand's, end, after amplification has been carried out,when the presence of that priming region would interfere with subsequentmanipulations. The cleavage of such a primer does not result in thedegradation of a partial (or a strand), because the entire cleavableprimer, or just the junction nucleotide that joins the primer to theremainder of the partial, is made chemically different from the rest ofthe partial (Example 4.3).

Our invention includes using sectioned arrays for carrying out preciselydirected recombinations between chosen segments of previously sequencednucleic acids. This recombination can be carried out on the arrays in amassively parallel fashion, resulting in production of many differentrecombinants, e.g., for screening, at the same time. The recombinantscan be constructed from isolated strands or their partials, or frommixtures of strands or their partials. This method involves the ligationof nucleic acids to each other on the surface of arrays. The immobilizedoligonucleotides either serve as sequence-specific “splints” that holdtogether the correct termini of nucleic acids, thereby ensuring theirspecific ligation, or they serve as protruding “sticky ends” that areadded to the terminus of a double stranded fragment to be ligated, andthat direct its ligation to the other desired fragment. In either case,each non-ligated end of the joined fragments has a priming region, sothat the recombinant strands (and only the recombinant strands) possessthe two terminal priming regions that are required for subsequentexponential amplification by PCR (Example 4.4).

Our invention also includes using sectioned arrays for introducingsite-directed mutations into sequenced nucleic acids, including theintroduction of nucleotide substitutions, deletions and insertions. Thiscan be carried out in a massively parallel fashion. In one embodiment, apartial whose variable end has been deprived of a priming region, isligated to the free terminus of an immobilized oligonucleotide thatcontains the mutation to be introduced. In another procedure, where thepurpose of mutagenesis is to introduce a single-nucleotide substitution,then the substituting nucleotide can be added directly to the variableend of the partial. In both cases, the modified partials or theircomplementary copies are used to synthesize a mutant strand utilizing asa template either the complementary parental strand (i.e., from whichthe partials were generated) or a longer complementary partial, or anyother strand or partial that encodes the missing region. The fixed endof the mutant partial is provided with a priming region that isdifferent from the corresponding priming region of the template strand.Therefore, only mutant strands are capable of subsequent amplificationby PCR. A single array can be used either to mutate many singlepositions in a gene, or to introduce mutations in many genes in oneprocedure. Sectioned arrays can also be used for the massively paralleltesting of the biological effects of the introduced mutations. Forexample, parallel coupled transcription-translation reactions can becarried out in the wells of a sectioned array following amplification ofthe mutant strands. It is thus possible to determine simultaneously, onthe same sectioned array, the effects of many different amino acidsubstitutions on the structure and function of a protein. This is usefulfor protein engineering (Example 4.5).

IV. Surveying Oligonucleotides with Binary Arrays

Our invention includes the use of binary arrays for surveying theoligonucleotides contained in nucleic acid strands and their partials todetermine their oligonucleotide content (see Examples 5.1 and 5.2,below).

Surveying allows information to be obtained about which oligonucleotidesare contained in a strand, in a partial, in a group of strands, or in agroup of partials. Survey arrays can be comprehensive. Essentiallycomprehensive surveying is useful in sequencing nucleic acids. Theinformation obtained can be used as a check on a sequence derived bysome other means, and thus can be used even if only a partial sequenceis obtainable from the survey. According to an important aspect of theinvention, discussed elsewhere herein, however, surveying, preferably ona binary array, can be used in combination with other methods describedherein to obtain complete sequences of longer nucleic acids than havebeen sequenced using conventional surveying techniques. Surveys can,also be used for diagnostic purposes.

Surveying can also be selective, where only certain oligonucleotides ofinterest are identified. In selective surveying, the array contains onlyselected oligonucleotides, that can be rather long without increasingthe size of the array. Selective surveying is useful for studyinggenetic variations, such as mutations and chromosomal rearrangements,when a reference sequence is known. It is also useful for orderingsequenced fragments in a longer nucleic acid, by identifying their“signature oligonucleotides” (discussed below). This method makes itunnecessary to repeat the complete sequencing of overlapping fragmentlibraries to obtain the sequence of a long nucleic acid.

The use of binary arrays also allows surveying to be improved ascompared with the use of ordinary arrays, and it allows new types ofselective surveying (such as surveying signature oligonucleotides) to becarried out.

A principle advantage of using binary arrays to survey oligonucleotidesis to improve markedly the discrimination against terminal mismatches.Terminal mismatches are responsible for most errors that occur inoligonucleotide surveys that are carried out by hybridization [Drmanacet al., DNA Cell Biol. 9, 527-534 (1990), supra]. According to thisaspect of the invention, terminal basepairs are checked for a mismatchin two enzymatic reactions, ligation and primer extension, that are bothhighly sensitive. A further advantage of using binary arrays is that ahybrid can be labeled at each end after it has formed, and in a mannerthat is dependent upon the success of these two enzymatic reactions,thus enabling background levels to be significantly reduced. Also,binary arrays can increase hybrid length (by ligation and extension),which allows the detection of hybrids to occur under optimal conditions.

In surveying, nucleic acid strands first can be randomly degraded intopieces whose average length slightly exceeds the surveyed length.Degradation of DNA strands prior to hybridization has been proposed toovercome interference from internal secondary structures that arepresent in a single-stranded DNA molecule [Lysov, Yu. P., Florentiev, V.L., Khorlin, A. A., Khrapko, K. R., Shik, V. V. and Mirzabekov, A. D.(1988). Determination of the Nucleotide Sequence of DNA UsingHybridization to Oligonucleotides. A New Method, Doklady Akademii NaukSSSR 303, 1508-1511]. There are, however, other advantages ofdegradation prior to hybridization. For example, degradationsignificantly increases the molar yield of hybridization that can beachieved with the same amount of material, especially in the case oflong nucleic acid strands (or partials). Moreover, degradation equalizesthe molar yield of individual hybrids that can be obtained from strandsof different length. Without degradation, once a DNA or RNA molecule isbound by one of its oligonucleotide segments, the rest of that moleculeis not available for hybridization. Therefore, the molar amount ofhybrids that are produced by a strand is inversely proportional to itslength, since longer strands are distributed among a larger number ofareas in an array. Degradation breaks each strand into many pieces ofthe same average length, and each of these pieces can hybridize to thesurvey array independently of the others. For example, degradation of a4,000-nucleotide-long strand into 20-nucleotide-long pieces can resultin up to a 200-fold increase in the molar yield of hybridization at eachrelevant area in an array. Moreover, there is the same molar amount ofhybrids at each relevant area in an array as would be produced by asimilarly fragmented strand that is only 200 nucleotides in length.Finally, random strand degradation allows each nucleotide in the strandto become a terminal nucleotide. This observation is taken advantage ofto increase specificity of hybridization in preferred methods ofsurveying oligonucleotides described below.

After degradation, each resulting nucleic acid piece is ligated to thesame type of oligonucleotide (i.e., a constant sequence), thatpreferably does not occur anywhere in the internal regions of theanalyzed nucleic acids. For example, the sequence of the addedoligonucleotide can contain the recognition site of a restrictionendonuclease that was used to digest the DNA prior to fragment sorting.The ligation can be carried out in solution prior to hybridization, orafter hybridization of the pieces to binary immobilized oligonucleotideswhose constant segment is complementary to the oligonucleotide to beligated. Preferably, a 3′ array is used, having constant segmentsupstream from variable segments. The immobilized oligonucleotides canthen be extended with an appropriate DNA polymerase, using thehybridized nucleic acid pieces as templates. It is preferable that afterextension all hybrids have the same length. This can be achieved byemploying dideoxynucleotides as substrates for the DNA polymerase, whichcauses the immobilized oligonucleotides to be extended by only onenucleotide. These methods can be used to survey both DNA and RNA (seeExamples 5.1.1 and 5.2).

Hybrids can be labeled in both a ligation-dependent and anextension-dependent manner to increase the specificity of hybriddetection, as described in Example 5.1.2, below. Also, the ligatedoligonucleotides and the added dideoxynucleotides can be tagged withdifferent labels, for example, fluorescent dyes of different colors. Thearray is then subsequently scanned at two different wavelengths, andonly those areas in the array that emit fluorescence of both colorsindicate perfect hybrids (see Example 5.1.2).

Survey results can be improved further by hybrid proofreading, bydestroying hybrids containing mismatches, by using chemical or enzymaticmethods (see Examples 5.1.1 and 5.2, below).

Selected oligonucleotides (see Example 5.1.3, below) and signatureoligonucleotides (see Example 5.1.4, below) can also be surveyed onbinary array, as is described below.

V. Use of the Oligonucleotide Arrays for the Sequencing of Nucleic Acids

The arrays and methods of this invention can be used to determine thenucleotide sequence of nucleic acids, including the sequence of anentire genome, whether it is haploid or diploid. This embodimentrequires neither cloning of fragments nor preliminary mapping ofchromosomes. It is especially significant that our method avoidscloning, a labor-intensive and time-consuming approach that isessentially a random search for fragments. In a preferred embodiment ofour invention, a comprehensive collection of whole nucleic acids ornucleic acid fragments is sorted into discrete groups. The sortednucleic acids are then amplified with a polymerase, preferably by apolymerase chain reaction.

This method has advantages over cloning. Cloning is a form ofamplification that begins with a single DNA molecule. The cloned DNA cancontain somatic mutations (including those caused by environmentalfactors) which were not present in the zygotic DNA, and which accumulateduring an individual's lifetime. Also, sequence alterations can occurwhen the DNA is cloned in the host cell. Moreover, cloning involvesselective steps that can reject some sequences in favor of others. Incontrast, the use of a polymerase, especially in a polymerase chainreaction, to amplify sorted fragments begins with a large number of DNAstrands, and the sequence obtained from the amplified material is anaveraged representative of the DNA in the analyzed sample, for examplethe DNA from many somatic cells, thus reflecting the sequence of zygoticDNA.

Sequencing large diploid genomes, such as a human genome, using thearrays and methods of this invention is shown in FIG. 6. We willdescribe the overall method in general terms. The overall method employsseveral more specific methods already described. For details, referenceshould be made to the descriptions set forth above and in the examples.In the embodiment illustrated in FIG. 6 an individual's genomic DNA 40is digested with a restriction endonuclease and sorted by terminalsequences into groups of strands using a 3′ sectioned binary sortingarray 13, as is described above in Section II and illustrated in FIG. 4.

Next, treating each well 13 a of the sorting array separately, acomplete set of partials is prepared for each group of sorted strandsusing a sectioned array 31, as is described above in Section III andillustrated in FIG. 5. The partials can be generated in any chosenmanner to make them detectable.

Then the contents of each well 31 a of the partialing array 31 issurveyed using a survey array 42, as is described above in Section IV.Preferably the survey array is a binary array, but an ordinary array mayalso be used. In the embodiment shown in FIG. 6, surveying is performedwith a sheet 43 containing miniature survey arrays 42 that have beenprinted in a pattern that coincides with the number and location of thewells in the partialing array. Miniature survey arrays are discussedfurther below. Larger arrays can be used as well for surveying. Theoligonucleotide information that is obtained can be used, according toour invention, to separately determine the nucleotide sequence of everystrand in each of the groups isolated on the sorting array. Theinvention can also be used to determine incomplete sequences, such aswhen ambiguous results are obtained because of, for example, thepresence of monotonous sequences or multiple repeats within the strands.The possibilities for ambiguous results, however, are minimized usingmethods described herein.

To determine the order of the fragments sequenced as illustrated in theembodiment of FIG. 6, genomic DNA 40 is digested with at least a secondrestriction endonuclease and sorted into groups of strands using a 3′sectioned binary sorting array 44, as is described above in Section IIand illustrated in FIG. 4. The contents of each well 44 a of the sortingarray 44 is surveyed with special survey arrays 45, 46 that identifysignature oligonucleotides (described below) in intersite segments ofsorted fragments from different digests. This is done to determine theorder of the fragments relative to one another without regard todifferences between allelic pairs of fragments. In the embodiment shownin FIG. 6 this surveying is performed with printed sheets 47, 48 thathave been printed with a pattern of miniature arrays 45, 46. Largerarrays can, of course, be used.

To allocate the ordered allelic fragments to their respectivechromosomes in a diploid organism, fragments are linked by their allelicdifferences. In the embodiment illustrated in FIG. 6, the strands fromselected wells of the sorting array 44 are transferred to a selectedwell of one of a series of partialing arrays 49, partials are generated,and the partials are surveyed using miniature survey arrays 50 onprinted sheets 51. Only the presence of oligonucleotides containingallelic differences in the selected partials needs to be determined tolink a pair of allelic fragments to their respective neighboring allelicfragments.

In some cases, abbreviated methods can be used for sequencing. Forexample, the final stage can be omitted when a haploid genome issequenced, because in this case the ordering of the fragments willimmediately result in their unambiguous linkage. If a mixture ofundegraded cellular RNAs is to be sequenced, even the ordering step canbe omitted.

As described above, this invention provides for comprehensive sequenceanalysis without resort to other methods (except for the resolution of asmall number of ambiguities). Of course, portions of the entireprocedure can be used independently, and in conjunction with othermethods, if desired. For example, partialing and survey arrays andmethods can be used to sequence cloned strands without sorting.Similarly, the fragment ordering procedure can be used to orderfragments that have been sequenced by any method. Finally, allelicfragments can be allocated to their chromosomes by the method of thisinvention, no matter how fragment order has been established.

A detailed description of the sequencing procedure will now be provided.As will be apparent, some of the methods described can be carried outusing conventional oligonucleotide arrays, as opposed to the novelarrays of the invention.

If the nucleic acid to be sequenced is a large DNA molecule, or amixture of large DNA molecules (such as the genome of a prokaryotic oreukaryotic organism), it is first digested by a site-specific methodthat results in the cleavage of each type of DNA in the sample atspecific locations within its sequence. One preferred method is tocleave with a restriction endonuclease and to sort by terminal sequencesusing a 3′ sectioned binary sorting array as described in Section IIabove. Advantageously, the length of fragments should not exceed aboutten thousand nucleotides, so that the fragments can be efficientlyamplified by PCR. The array used for strand sorting should becomprehensive (see Section I, above) so that no strand is lost. Thelength of the variable segments chosen (and therefore, the overallnumber of different types of oligonucleotides in the array) will dependon the complexity of the sorted fragment mixture, and preferably shouldbe chosen so that there will be no more than 100 or so different strandssorted into a well. The choice should be made according toconsiderations discussed in Section II, above.

For linear DNA (as opposed to a circular DNA) almost every strand isprovided with two terminal priming regions, each of which includes therecognition site for the restriction endonuclease or other site-specificagent used for digesting the DNA. Almost every strand will therefore beexponentially amplifiable by PCR. Those strands that arise from thefragments at the ends of each DNA will only have one priming region.Strands originating from terminal (telomeric) fragments will possess apriming region at only one end, and cannot be exponentially amplified byPCR. Telomeric fragments can be isolated in a separate procedure thatutilizes affinity to characteristic telomeric sequences. For example, inhuman chromosomes the telomeres consist of many characteristic tandemrepeats of TTAGGG, which will bind to their complement on an array[Alberts, B., Bray, D., Lewis, J., Raff, M., Roberts, K. and Watson, J.D. (1989). Molecular Biology of the Cell, 2nd edition, GarlandPublishing, New York]. Alternatively, telomeric fragments can beisolated by specifically binding them to a telomere protein [see, forexample, Raghuraman, M. K. and Cech, T. R. (1989). Assembly andSelf-association of Oxytrichia Telomeric Nucleoprotein Complexes, Cell59, 719-728].

When sorting according to the identity of terminal sequences, eachstrand occupies a particular “address” in the array. It is convenient tothink of the address as the oligonucleotide sequence within a strandthat directs the DNA strand to hybridize to a particular location withinthe array, i.e., the sequence that is perfectly complementary to thevariable sequence of the oligonucleotide immobilized at that location.The “address” also identifies the location within the array where theDNA binds.

After sorting, each group of strands is amplified (described in theexamples and Section II above) and subjected to partialing (see Examples3.1 to 3.3, below). Importantly, the isolation of individual strands isnot necessary, because our method allows the nucleotide sequence of eachstrand in a mixture to be determined. In particular, our method allowsthe sequences of strands in a well of the sorting array to bedetermined, separately from mixtures of strands in other wells. In apreferred embodiment, the partialing array is comprehensive (see SectionI) in order to obtain all possible one-sided partials (i.e., acomprehensive array). At the same time, smaller partialing arrays havingsome oligonucleotides excluded, can also be used for partialing toobtain sequence information as discussed below in this section. Eachgroup of partials is amplified prior to surveying. Most preferably, theamplification is carried out in such a manner that one of the twocomplementary partial strands is produced in great excess over theother.

Each group of partials is surveyed by hybridization to a survey array,in order to identify their constituent oligonucleotides. Surveying ispreferably carried out using binary arrays (see Example 5.1, below) butcan be performed with ordinary arrays. The arrays are preferablycomprehensive, in order to obtain a complete list of the oligonucleotidesegments that are contained in the partials.

The selection of the optimal lengths to use for variable segments inboth the partialing arrays and the survey arrays depends on thecomplexity of the groups of strands to be analyzed and on the length ofthose strands, and should be based on both theoretical calculations,such as those discussed at the end of this section, and preliminaryexperiments (with model mixtures of fragments whose sequence is known)designed to evaluate the resolving capacity for each array size. Ourcalculations show that if the basic length (minimal length) of thevariable segments in both the partialing arrays and the sorting arraysis eight nucleotides, the arrays should be adequate for sequencinggroups of about 50 strands whose average length is 4,000 nucleotides. Ifoctameric variable segments are used as a basic length, then acomprehensive partialing array will contain at least 65,536 wells. Forsequencing smaller groups of similar fragments, or similar groups ofshorter fragments, shorter variable segments, and consequently, smallerpartialing arrays, can be used. The basic length of the variablesegments in the oligonucleotides immobilized on the survey arrays mustsuit the combined length of all partials in each well of the partialingarrays, so that there are always unoccupied areas in a survey array. Ifa group of about 50 4,000-nucleotide-long strands is subjected topartialing on a single partialing array, then the basic length of eightnucleotides should be adequate for the variable segments in theoligonucleotides immobilized on the survey arrays. In this case, acomprehensive survey array will contain at least 65,536 different areas.The number of different areas in the survey arrays can be madeapproximately 50% greater due to the inclusion of special longeroligonucleotides, in order to read through regions of recursivesequences in the strands.

Although not necessary, it is preferable to have the survey arrays be ascompact as possible. It is anticipated that surveying will beadvantageously accomplished simultaneously for many or all wells of apartialing array by utilizing a sheet on which miniature survey arrayshave been “printed” in a pattern that coincides with the arrangement ofwells in the partialing array, in a manner similar to that shown inFIGS. 6 and 7. Referring to FIG. 7, partialing array 31, comprising anarray of wells 31 a, is surveyed using sheet 43, having printed thereonan array of miniaturized survey arrays 42. The pattern of arrays 42corresponds to the pattern of wells 31 a, whereby all wells 31 a can besurveyed simultaneously.

Automated photolithography techniques for preparing miniatureoligonucleotide arrays have been developed (Fodor, S. P., Read, J. L.,Pirrung, M. C., Stryer, L., Lu, A. T. and Solas, D. (1991].Light-Directed, Spatially Addressable Parallel Chemical Synthesis,Science 251, 767-773). The manufacture of miniature arrays on a “chip”,for use in surveys also has been reported [Fisher, L. M. (Mar. 3, 1991).Microchips for Drug Compounds, The New York Times, p. F7]. It is not,however, necessary to practice the invention that printed arrays be usedfor surveying. The contents of wells to be surveyed can be transferredto large arrays instead, having sufficiently amplified the partialspreviously to make them abundant enough to be detectable.

Surveying with comprehensive arrays produces a complete list ofoligonucleotides contained in the partials in each well of thepartialing array. As discussed below, the partials in each well sharethe same terminal variable oligonucleotide. It is important to note thatif an oligonucleotide occurs more than once in the same parental DNAstrand (or in more than one of the different parental DNA strands) inthe same well, there will be more than one different partial strand inthat well of the partialing array. The survey will reveal alloligonucleotides that are present in all partials in that well. Themethod of this invention can determine the sequences of each of theoriginal (parental) fragment strands.

Considering one parental strand, the partial strands are generated insuch a manner that they all begin with the same parental terminalsequence, but terminate at a different nucleotide in the parentalsequence. A different partial strand is generated for every nucleotideposition in the parental sequence. The collection of partials willtherefore consist of a nested set in which each successive partialstrand is at least one nucleotide longer (if a comprehensive partialingarray is used). An illustration of a nested set of partials is shown inFIGS. 8 and 9.

The “partials” referred to in this section are one-sided partial strandsthat begin at the 5′ terminus of a parental nucleic acid strand (thefixed end) and end at different nucleotide positions in the strand (thevariable end). Partials are sorted in the partialing array according tothe identity of their variable ends, and therefore each partial has aparticular “address” within the array. As with sorting arrays, an“address” in a partialing array is the oligonucleotide sequence that ispresent at the variable end of the partial strand and that iscomplementary to the variable segment of an immobilized oligonucleotide.The shortest partials used are as long as the oligonucleotide sequenceat the variable end, i.e., the address plus priming region(s) at thepartial's end(s). The “address” also relates to the location within thearray where the partial strand is found, since the variable segment ofthe oligonucleotide immobilized in that well is complementary to theoligonucleotide at the partial's variable terminus. The “address” alsorelates to the location within the parental strand of a partial'sterminal oligonucleotide. The location of this “address oligonucleotide”within a parental strand is characterized by an “upstream subset” ofoligonucleotides that come before it in the parental sequence and by a“downstream subset” of oligonucleotides that come after it.

Our method of establishing nucleic acid sequences, for either a singlestrand or a group of parental strands sorted by their terminalsequences, begins by assembling an “address set” for each address in thepartialing array. The “address set” is a comprehensive list of all ofthe oligonucleotides in all the parental strands which have the addressoligonucleotide within their nucleotide sequences. The “upstream subset”contains all the oligonucleotides that occur upstream (i.e., towards the5′ end) of the address oligonucleotide in any parental strands thatcontain the address oligonucleotide. The “downstream subset” containsall the oligonucleotides that occur downstream (i.e., towards the 3′end) of the address oligonucleotide in any parental strands that containthe address oligonucleotide. Taken together, the upstream subset and thedownstream subset form the “address set.”

The upstream subset of each address can be determined directly from thesurvey of each well of a partialing array and consists of a list of allthe oligonucleotides identified as being present in the partial strandsin that well. The downstream subset of each address can be inferred byexamining the upstream subsets of all the addresses in the partialingarray: the downstream subset of a particular address consists of thoseaddresses whose own upstream subset includes that particular addressoligonucleotide. FIG. 8 illustrates how we infer the downstream subsetof a particular address from the upstream subsets of the otheraddresses. Note that the address oligonucleotide is included in both itsupstream and downstream subsets, and divides the address set into thetwo subsets.

The terms “partials” and “addresses” can perhaps be more easilyunderstood by reference to FIG. 9, wherein a complete set of partials isshown for the strand 5′-ATGAGCCTAGATCGGT-3′, which is sixteennucleotides long. In this illustration, only one strand is beingsequenced. The method of this invention is not so limited, however. Ithas the power to sequence simultaneously a mixture of strands. In FIG.9, the oligonucleotides at the variable ends of the partials (i.e.,their addresses) are three-nucleotide sequences, as are theoligonucleotides surveyed. Accordingly, both the partialing array andthe survey arrays used to obtain these results would have 4³, or 64areas, each coated with a different oligonucleotide sequence whosevariable segment is three nucleotides long. The use of such a smallarray is presented here for ease of illustration, as larger arrays aregenerally to be used. Terminal priming regions are not shown for thesame reason. (It should be noted that the length of the variablesegments in the partialing arrays and in the survey arrays need not bethe same, i.e., the length of the address oligonucleotides and thelength of the surveyed oligonucleotides can be different.) The strandshown in FIG. 9 has fourteen addresses. Starting from the 5′ end, thefourteen addresses are ATG, TGA, GAG . . . GGT. The shortest partial,ATG, is three nucleotides long, and has the address ATG (i.e., thepartial was sorted on the partialing array by its variable terminalsequence: ATG). The next shortest partial, ATGA, is four nucleotideslong, and has the address TGA. For the other twelve partials, the lastthree nucleotides in each is its address. The addresses, as they appearin the partials, are underlined in FIG. 9, depicting visually how theaddresses propagate down the strand from the 5′ end to the 3′ end. Thelargest partial is the entire strand of sixteen nucleotides. Thecomplete set of partials is shown nested in FIG. 9 with the longestpartial shown on the top of the diagram, and the shortest partial shownon the bottom.

If an “address” were defined to be four nucleotides long, the firstaddress in the strand of FIG. 9 would be ATGA, which would be the firstof thirteen partials. If an “address” were five nucleotides long, thefirst address in the strand of FIG. 9 would be ATGAG, which would be thefirst of twelve partials.

Where the address contains eight nucleotides, a strand having a lengthof 4,096 base pairs would contain up to 4,089 different oligonucleotideswhich are eight nucleotides long, and therefore up to 4,089 differentaddresses; accordingly, up to 4,089 different partials would be acomplete set generated for such a strand.

As shown in FIG. 9, according to the method of this invention theaddress set for an arbitrarily chosen address “TAG” contained in theparental strand is determined from the oligonucleotide informationobtained from the partials. For the address “TAG”, the upstream subset,i.e., those oligonucleotides that occur 5′ of TAG in the parental strand(plus TAG itself), contains (in alphabetical order) AGC, ATG, CCT, CTA,GAG, GCC, TAG, and TGA. The downstream subset of this address containsAGA, ATC CGG, GAT, GGT, TAG, and TCG.

To obtain the upstream subset for the “TAG” address set we survey theoligonucleotide content of the well in the partialing array to which thepartial that contains the TAG oligonucleotide at its variable terminushybridized. That well contains the immobilized complementaryoligonucleotide “CTA”. (The partialing array, and other arrays used inthis invention, are preferably arranged so that the identity of theimmobilized oligonucleotides in each well or area is known from itsposition within the array.) A survey of the oligonucleotides in thiswell provides the upstream subset of the TAG address.

The downstream subset for the TAG address, i.e., those oligonucleotidesthat occur on the 3′ side of TAG in the parental strand, is inferred bydetermining which other addresses contain the TAG oligonucleotide intheir upstream subsets. For example, a survey of the well containing animmobilized CGA reveals that the partial with address TCG in FIG. 9,contains TAG among its constituent oligonucleotides. Therefore, TAG iscontained in the upstream subset of the address TCG, and, consequently,the TCG oligonucleotide must be contained in the downstream subset ofthe TAG address. From the survey results of all the other addresses inthe partially array, we similarly determine all other oligonucleotidesin the downstream subset of the TAG address.

The upstream subset and the downstream subset of a particular address,taken together, are an “indexed address set”. If an oligonucleotideoccurs more than once in a strand, it can occur in both the upstream andthe downstream subsets of an address. Indexed address sets provide theinformation required to order the oligonucleotides contained in a strandset, as will be described below. When a mixture of strands is examined,it is also useful to consider an address set without regard to whicholigonucleotides occur upstream and downstream of an address. This iscalled an “unindexed address set”. Unindexed address sets aredecomposable into strand sets by the method of this invention.

FIGS. 8 and 9 depict a situation in which only one strand is analyzed.In this simple case, once the indexed address sets are inferred forevery address contained in the parental strand (in this illustrationthere are 14 address sets), the relative position of each addressoligonucleotide within the strand is determined by comparing addresssets to each other. For example, the address set for “ATG” has noupstream addresses and thirteen downstream addresses. The address setfor “TGA” has one upstream address (ATG) and twelve downstreamaddresses, etc. It follows that ATG comes in the strand before TGA. Inthis manner we determine the order of the address oligonucleotideswithin the parental strand.

We have discovered that when assembling big strand sets whoseoligonucleotides do not all overlap uniquely, it is advantageous to workwith “sequence blocks” rather than with individual oligonucleotides.Sequence blocks are composed of oligonucleotides that uniquely overlapone another in a given strand set. Two oligonucleotides contained in astrand set are said to overlap if they share a terminal (5′ or 3′) n−1nucleotide sequence. An overlap is unique if no other oligonucleotidethan those two in the strand set has this sequence at its termini. Heren is the length (in nucleotides) of each of the two oligonucleotides ifthey are of the same length or, if they are of different length, n isthe length of the shorter one. We use unique overlaps to constructsequence blocks from the oligonucleotides in a strand set.

We can use the strand depicted in FIG. 9 as an illustration. Byexamining the address sets obtained using partialing and surveyingmethods (described above and discussed in more detail later), the set ofall oligonucleotides in a strand will have been determined. For example,the set of oligonucleotides that occur in the strand shown in FIG. 9will have been determined to be, in alphabetical order: AGA, AGC, ATC,ATG, CCT, CGG, CTA, GAG, GAT, GCC, GGT, TAG, TCG and TGA. To begin themethod of assembling those oligonucleotides into the strand sequenceshown in FIG. 9, we use unique overlaps to assemble sequence blocks, aswill now be described in conjunction with FIG. 9A.

Because the oligonucleotides in the set are trinucleotides (n=3), n−1 istwo. We examine, therefore, the first two nucleotides and the last twonucleotides of each address. Referring to FIG. 9A, the strand set offourteen trinucleotides is shown first. Then each trinucleotide is shownas a pair of dinucleotides; e.g., AGA is shown as AG and GA. We examinethose dinucleotides. If a dinucleotide occurs only twice, it indicatesthat two oligonucleotides uniquely overlap. The dinucleotide GC occursonly in trinucleotides AGC and GCC, so these two trinucleotides areassembled as shown in FIG. 9A, in the order shown there, AGCC. To see ifthis block can be enlarged, we examine its 5′-terminal and 3′-terminaldinucleotides. Its 5′-terminal dinucleotide, AG, occurs in fourtrinucleotides (AGA, AGC, GAG and TAG), therefore it is not a uniqueoverlap. Thus, the block AGGC cannot be extended in the 5′ direction.Its 3′-terminal dinucleotide, CC, in contrast, occurs in only twotrinucleotides, GCC and CCT. Therefore, block AGCC can be extended atits 3′ end to form AGCCT. For the same reason, the block can be extendedat its 3′ end by inclusion of oligonucleotides CTA and TAG to formAGCCTAG, but further extension of the block at its 3′ end is notpossible because of non-uniqueness of overlap AG. Similarly, blocksATCGGT, ATGA, AGA, GAG, and GAT can be isolated from the rest of thestrand set. Note that block ATCGGT cannot be extended at its 3′ endbecause dinucleotide GT is only present in GGT, and in no otheroligonucleotide. This means that, in this particular example, this blockis the 3′ terminal in the strand. Blocks AGA, GAG, and GAT are identicalto oligonucleotides in the stand set, because the overlaps they can form(AG, GA, and AT) are not unique overlaps.

Whether an oligonucleotide is downstream or upstream of another in astrand set is not considered in the formation of the blocks, but thisinformation is used at the next step, during ordering the sequenceblocks.

FIG. 10 shows a schematic overview of the way in which a nucleic acidsequence is assembled from a strand set. This is done by examining thedistribution of oligonucleotides in the upstream and downstream subsetsof relevant address sets. A strand set, shown schematically, has sixteenunordered oligonucleotides (FIG. 10 a). They are each identified by apattern which indicates the particular group of uniquely overlappingoligonucleotides (FIG. 10 b) which can be assembled into a sequenceblock, illustrated in FIG. 9A. The individual sequence blocks areschematically represented in FIG. 10 c. Then, the position of eachsequence block relative to the others is determined from thedistribution of the oligonucleotides between the upstream and downstreamsubsets of every address (10 d). This is accomplished by finding, foreach of the blocks, which blocks occur upstream, and which blocks occurdownstream, of that block by examining the address sets. The addresssets are used in order to generate “block sets.” The block sets areaddress sets wherein blocks have been substituted for theoligonucleotides that comprise the blocks, including the addressoligonucleotide (FIG. 10 e). Once the relative position of the sequenceblocks has been determined, they can be assembled into the finalsequence. The assembly is governed by the following rules: (1) each ofthe blocks must be used at least once, (2) the blocks must be assembledinto a single sequence, (3) the ends of neighboring blocks must matcheach other (i.e., overlap by an n−1 nucleotide sequence, see above) and(4) the order of the blocks must be consistent with their positionsrelative to one another, as ascertained from the block sets.

A sequence block can occur either once in a sequence, or more than once,and this we determine by examining the block sets. If a block occursmore than once in a sequence, it will always be contained in both itsown upstream and downstream subsets. On the other hand, if a blockoccurs only once in a sequence, it may or may not be present in its ownupstream or downstream subset. But, if a block is absent from either itsupstream subset, or from its downstream set, that block occurs in thestrand only once. Therefore, from an examination of the block sets ofFIG. 10, it can be seen that three of the blocks occur only once in thestrand being sequenced (FIG. 10 f). The relative order of these “unique”blocks can be determined by noting which of them occur in the upstreamsubset, and which of them occur in the downstream subset, of the others.Once the unique blocks have been ordered relative to each other, thegaps between them are filled with blocks that may be non-unique.However, not every gap can necessarily be filled in with a particularblock. There is a range of locations within which each non-unique block(or presumably-non-unique block) can be present. The range for aparticular block is determined by noting those blocks that always occurupstream of it, and those blocks that always occur downstream of it. InFIG. 10 g, the range for each of the two potentially non-unique blocksis indicated by brackets. A gap can be filled in if, and only if, thereis a block or a combination of blocks, whose outer ends have n−1nucleotide-long perfect sequence overlaps with the ends of the blocksthat form the gap (indicated in FIG. 10 by their having compatibleshapes). Because at least two overlaps, each of low probability, mustoccur simultaneously, it is highly unlikely that more than one block, orone combination of blocks, can fill a gap. If a particular block occursmany times in a strand, it will have to be used to fill every gap itmatches. This is why, using the method of the invention, it is possibleto establish the sequence of a strand (as shown in FIG. 10 h) withoutmeasuring how many times an oligonucleotide occurs in the partials. Itis only necessary to determine whether an oligonucleotide is present ornot.

We estimate that if the basic length of the variable segments used in apartialing array and a survey array is eight nucleotides, then thismethod can determine the sequence of strands that are many thousands ofnucleotides long. Shorter variable segments can be used to determine thesequences of shorter strands.

While it is not always possible to avoid all ambiguities with thissequencing procedure, it is quite feasible to limit them to a smallenough number so that they can be resolved, if desired, with anindependent sequencing technique. The most significant source ofambiguities when utilizing our overall method is the presence within thestrands of recursive, or monotonous, regions that consist of perfectrepeats of identical units comprised of one, two, three, or morenucleotides, such as . . . AAAAAAAAA . . . or . . . ACACACACACAC . . .for example, the sequence 5′-GGTTGACTGACTGACTGACTGACGGTT-3′ contains thetetrameric sequence TGAC repeated five times. The occurrence of suchsequences will result in the appearance of sequence blocks possessingself-overlapping termini. If this occurs, it will not be possible toknow how many times those blocks are repeated in a particular region ofthe analyzed strand. The smaller the recurring unit, the shorter is thesequence block and, therefore, the higher is the probability of itsoccurrence among the analyzed strands. The most difficult case is ahomopolymeric region, where the recurring unit consists of onenucleotide. In that case, the length of the self-overlapping sequenceblock will be equal to the surveyed length. The probability of finding arecursive sequence with a longer recurring unit declines steeply with anincrease in the length of the recurring unit. When the surveyed lengthis eight nucleotides, then almost all the ambiguities will arise fromrecursive sequences composed of recurring units that are sevennucleotides or less. Fortunately, the shorter the recurring unit, thefewer types there are. For example, there are only four unit types ifthe unit is a mononucleotide, twelve (4²−4) types if it is adinucleotide; sixty (4³−4) types if it is a trinucleotide, and so on. Itis thus practicable to include in the survey array an additional numberof longer oligonucleotides which are complementary to recursivesequences that contain short recurring units. The use of longer probesfor resolving recursive regions was suggested by Drmanac et al. for theanalysis of arrays made of DNA strands [Drmanac, R., Labat, I., Brukner,I. and Crkvenjakov, R. (1989). Sequencing of Megabase Plus DNA byHybridization: Theory of the Method, Genomics 4, 114-128]. For a surveyarray containing all variable octanucleotides, an approximately 1.5-foldincrease in the number of oligonucleotides will drastically reduce thenumber of ambiguities caused by recursions. Any ambiguities that remainwill not affect the assembly of the sequence blocks that occur within astrand outside of the recursive region. Consequently, the rest of thesequence will be determined unambiguously. Furthermore, it will be knownwhere strands and partials that contain a particular recursion can befound in the sorting and partialing arrays. Therefore, if desired, thenumber of repeats in the unresolved recursive region can be determinedby analyzing these strands or partials by conventional sequencingtechniques.

An important aspect of this invention is the ability to sequence amixture of strands simultaneously. The invention can be used for thedetermination of fragment sequences from an entire fragmented and sortedgenome.

If one strand is being sequenced, then all the address sets determinedfrom a partialing array will contain the same oligonucleotides thatconstitute the strand set. The only difference is that someoligonucleotides which are downstream in one set may be upstream inanother address set. If a mixture of strands have been partialed on asingle partialing array, certain addresses will be shared by more thanone parental strand. Their address sets will be composite, containingall of the oligonucleotides from all of the strands that the addressoligonucleotide is present in. Addresses that are only found in aparticular strand in the mixture, however, will have address sets whichonly contain oligonucleotides from that strand. They are identical tothe strand set, and each contain the same oligonucleotides. The mixturecan contain up to a hundred or so different DNA strands, each of adifferent length and sequence, as can be obtained with an appropriatesorting array (or set of sorting arrays) and method described above.When a mixture of strands is analyzed on a partialing array, the dataobtained by surveying the partials will reflect the diversity of thesequences in the mixture, and will appear to be very complex. However,we have discovered a way to decompose the unindexed address setsobtained by analysis of a strand mixture into their constituent strandsets. Then, as we have described for sequencing a single strand, theoligonucleotides in each of the identified strand sets can be groupedinto sequence blocks and the blocks can be ordered from the informationcontained in the indexed address sets.

FIG. 11, diagram A, shows schematically data from a mixture of strands.For purposes of illustration, the mixture is limited to three strands,although the number of strands is not readily apparent in FIG. 11. Theoligonucleotides found in a survey of the partial strands at eachaddress are represented as unfilled bold rectangles and are identifiedby lower case letters on the top of the diagram. Addressoligonucleotides are represented by filled black rectangles and areidentified by lower case letters on the side of the diagram. Eachhorizontal line of rectangles shows which oligonucleotides are presentin the upstream subset of the address oligonucleotide shown on thatline. Diagram B shows the corresponding downstream subsets inferred fromthe data shown in diagram A. Each horizontal line of shaded rectanglesin this diagram shows which oligonucleotides are present in thedownstream subset of the address oligonucleotide shown on that line.Note that the pattern in diagram B can also be obtained from the patternin diagram A by rotating the pattern in diagram A about the diagonalformed by the address oligonucleotides.

The oligonucleotides that constitute the downstream subset of an address(“first address”) can also be determined directly from the survey data,provided that the mixture of strands applied to a partialing arraycontains both direct copies and complementary copies of each strand.Such a mixture of strands results from symmetric PCR amplification ofstrands in a well of a sorting array. In that case, the partial(s)sorted into the well with an address that is complementary to the firstaddress will have been generated from the strands that are complementarycopies of the parental strand(s). Their partials are complementary tothe downstream portion of strands that are direct copies of the parentalstrand(s). That downstream portion contains the downstream subset ofoligonucleotides missed from the partial(s) at the first address. Inother words, oligonucleotides contained in the partials from thecomplementary address are complementary to the oligonucleotides thatconstitute the downstream subset of the first address.

Thus, the information obtainable by surveying the wells of acomprehensive partialing array is highly redundant. In fact, theinformation is repeated twice if complementary strands are partialedtogether: essentially the same information is collected from theanalysis of complementary addresses. This fact can be taken advantageof, for example, in filtering out errors that can occur duringsurveying.

This redundancy also provides one way to reduce the number of wells in apartialing array without losing information that is essential fordetermining strand sequences. For example, a collection ofoligonucleotides in a comprehensive array can be divided into two halvessuch that the (variable) sequences in one half have complementarycounterparts in the other half. A partialing array containing eitherhalf can be used for partialing mixtures of complementary copies ofstrands to obtain comprehensive oligonucleotide information about eachstrand in the mixture. For this reason, it is not necessary to usecomprehensive arrays to obtain the information usable to sequencestrands.

The information contained in the upstream and downstream subsets of eachaddress can be combined to form unindexed address sets. Diagram C showshow this information can be obtained by superimposing diagrams A and B.Oligonucleotides present in both the upstream and the downstream subsetof the same address will occur at the same position in the superimposedpattern (represented as shaded bold rectangles). Consequently, eachhorizontal line of rectangles in the resulting pattern (diagram D) showswhich oligonucleotides are present in either the upstream or thedownstream subset of the address identified by the lower case letter onthe side of the diagram. These unindexed address sets are used toidentify the strand set of each DNA in the original mixture.

Each parental strand in a DNA mixture binds to many different areas(addresses) in the partialing array. The number of different parentalstrands that bind to a given address in the array depends on how many ofthe strands possess the address oligonucleotide. It follows that afterpartialing the mixture, an occupied address in the array may containone, and possibly more than one, partial strand generated from eachparental strand possessing that address. Accordingly, the upstreamsubset of an address will contain the address oligonucleotide and allthe other oligonucleotides that occur upstream of the addressoligonucleotide in every parental strand that binds to that address inthe array. Put another way, the upstream subset of an address will bethe union of the upstream subsets of each parental strand containing theaddress oligonucleotide. Similarly, the downstream subset of an addresswill be the union of the downstream subsets of each parental strandcontaining the address oligonucleotide. And finally, each unindexedaddress set (identified by the procedure shown in FIG. 11) will be theunion of the strand sets of each parental strand containing the addressoligonucleotide. No matter whether an address set is composed of onestrand, or is composed of more than one strand, each strand willcontribute all of its oligonucleotides to the address set.

Unindexed address sets can be either “prime” or “composite.” A prime setconsists of one strand set; while a composite set consists of more thanone strand set. Accordingly, it is characteristic of a prime set that itcannot be decomposed into other address sets, i.e., there is no addressset which is a subset of a prime set. Composite sets, however, canusually be decomposed into two or more simpler address sets.

FIG. 12 illustrates, schematically, how unindexed address sets can bedecomposed into constituent strand sets. If a number of differentaddress sets consist of the same strand set, or consist of a particulargroup of strand sets, then those address sets will be identical.Therefore, for the sake of simplicity, we can sort all the address setsinto groups of identical address sets. For example, diagram A in FIG. 12shows the different groups of identical address sets (I through V) thatcan be formed from the address sets identified in diagram D of FIG. 11.The address sets in three of these groups (I, II, and III) appear to beprime sets, because these address sets cannot be decomposed into otheraddress sets. The address sets in the other two groups (IV and V) areclearly composite sets: they contain oligonucleotides that constitutetwo or more prime sets. Thus, group IV includes all oligonucleotidesbelonging to groups II and III, and group V includes alloligonucleotides that belong to three groups of prime sets (I, II andIII).

By using the five groups of address sets, we can build three “pyramids”(FIG. 12, diagrams B, C and D), such that on the top of each there areprime sets (i.e., address sets that do not contain other address sets astheir subsets). The rest of a pyramid is comprised of address sets thatinclude the top address sets (i.e., prime sets) as a subset. Thesecommon oligonucleotides comprise full columns in the three pyramids, andthe oligonucleotides common to each pyramid constitute three strandsets. It can be seen from diagrams B, C, and D of FIG. 12, that theoligonucleotides contained in a strand set are identical to theaddresses whose address sets form a pyramid. This is exactly what isexpected, since a strand set must contribute all of its oligonucleotidesto each pertinent address set.

Specific examples of interpreting the oligonucleotide informationobtained by partialing mixtures of strands and by surveying theoligonucleotide content of the partials that are present in the wells ofthe partialing array are given below (see Examples 6.1 and 6.2).

FIG. 12 illustrates how strand sets can be identified, when each strandset contains at least one oligonucleotide that is not present in anyother strand set. The unindexed address set associated with a uniqueoligonucleotide contains only one strand set, and it is a prime set.However, there can be situations when there is no oligonucleotide in astrand set that is unique, in the sense used above. This is expected tooccur frequently when fragments from diploid genomes are examined.Restriction fragments will occur as allelic pairs, and allelic strandswill, as a rule, hybridize to the same address in a sorting array. Inthat case partial strands generated from the mixture of strands presentin a single well of a sorting array will originate from pairs of allelicstrands. Since allelic nucleotide differences occur roughly once inevery thousand nucleotides, the two strand sets will, in general, beidentical, except for a few oligonucleotides, and most of the addressesthey occupy in the partialing array will also be identical. A similarsituation will arise when strands originate from repeated genome regionsthat contain sequence microheterogeneities.

When there are many other different strands in a sample, there will be ahigh probability that the oligonucleotides that account for the fewdifferences between quasi-identical strands will not be unique in amixture of strands. In that case, there will be no prime address set foreach of the quasi-identical strands. Even if an oligonucleotide occursonly in the quasi-identical strands, the address set associated withthat oligonucleotide will be a composite of the strand sets of thequasi-identical strands. That address set will not be decomposable intoother address sets, and it will therefore appear to be a prime set, asshown below.

Such a “pseudo-prime set” is illustrated in FIG. 13. The address sets ingroup I of diagram A appear to be prime sets, because they cannot bedecomposed into other address sets. However, inspection of the list ofoligonucleotides contained in the group I address sets shows that notall of them are found among the addresses of the corresponding pyramid(made of the group I and group II address sets). The missed addressesare “b”, “g”, “f”, and “p”. At the same time, the respective addresssets from groups III and IV (they are shown in diagram A beneath thedashed line) cannot be included in the pyramid, since address sets “b”and “g” do not contain oligonucleotides “f” and “p”, and address sets“f” and “p” do not contain oligonucleotides “b” and “g”, all of whichare present among the group I oligonucleotides. This means that theaddress sets from group I do not consist of a single strand set (i.e.,they are pseudo-prime sets). Pseudo-prime sets can be decomposed intotheir constituent strand sets by finding (building) pyramids thatinclude some of the missed groups of address sets, and that have theproperty that the list of the oligonucleotides common to every addressset in a pyramid is identical to the list of the pyramid's addresses.The result of such a decomposition is shown in diagrams B and C of FIG.13. In each of these diagrams, there are oligonucleotides that arecommon to every address set, and they are seen as complete columns ofrectangles. Every one of these oligonucleotides is found among theaddress oligonucleotides listed on the left side of the diagram. Notethat a pyramid that includes both groups III and IV (in addition togroups I and II) would not satisfy the above criterion. In that case,the list of addresses would exceed the list of common oligonucleotides,since oligonucleotides “b”, “g”, “f”, and “p” are not common to allthese groups.

Pseudo-prime sets can not always be detected and decomposed into strandsby this procedure. This situation occurs when the oligonucleotides thatare unique within a pair of the quasi-identical strands, are all presentin one other strand in the mixture. This is expected to be a raresituation, but one which may occur when analyzing DNA that is the sizeof the human genome. It can be diagnosed by the inability of thesequence blocks that are formed from a set that is supposed to be aprime set to be assembled into one contiguous sequence. When thishappens, an analysis of the same quasi-identical strands within adifferent group of strands (obtained from a different well in thesorting array) can be helpful. This well is the well where strandscomplementary to those being analyzed were originally bound in thestrand sorting array. In different wells, the strands from the samefragment will be enmeshed in a different group of strands. Thesedifferent sequence contexts will interfere differently with thedetermination of the sequence of the strands and, thus, will oftenprovide a way around the problem.

Once the individual strand sets have been identified, they can each betreated as though they were obtained from an analysis of a homogeneousstrand. As was described earlier, the oligonucleotides in the strand setcan be assembled into sequence blocks, and the location of thesesequence blocks relative to one another can be determined from thepresence or absence of these blocks in the upstream and downstreamsubsets of the relevant addresses. It is thus possible, in many cases,to sequence all the strands in an unknown heterogenous DNA samplewithout first isolating them from one another.

In this manner, the complete nucleotide sequence of every strand in amixture can be determined. Occasional errors in the input data due tothe presence of false hybrids on a survey array, or due to missinghybrids, are markedly reduced by the redundancy of having many differentpartials for each strand, and by the fact that each group of partials isanalyzed separately. After each of the groups of sorted fragments hasbeen analyzed by this partialing method, the sequence of almost everyrestriction fragment in the original digest will be known. Methods tominimize ambiguities in sequencing are discussed later.

The fragment sequences obtained by the methods outlined above or by anyother method can then be put in their correct order usingoligonucleotide arrays. Assembling restriction fragments into contiguoussequences can be accomplished by identifying each fragment's immediateneighbors. One method for obtaining this information is to use anotherrestriction enzyme to cleave the same DNA at different positions, thusproducing a set of fragments that partially overlap neighboringfragments from the first digest, and then to sequence these fragments inorder to identify the neighbors. However, it is not necessary tosequence the fragments in the second restriction digest. It is onlynecessary to uniquely identify overlapping segments in the fragmentsfrom alternate restriction digests. This can be accomplished bysurveying “signatures”.

Signatures can be determined by hybridization of the fragment strands tocomplementary oligonucleotide probes. A signature of a fragment mayconsist of one, two, or more oligonucleotides, so long as it is uniquewithin the DNA sequence being analyzed.

Neighboring fragments from one restriction digest can be determined bylooking for their signatures in overlapping fragments from an alternatedigest. This principle has been used by others to order an array ofcloned fragments immobilized on a solid support. Overlapping fragmentshave been identified by the “fingerprint” pattern created when a seriesof short oligodeoxynucleotide probes are hybridized to the fragments[Craig, A. G., Nizetic, D., Hoheisel, J. D., Zehetner, G. and Lehrach,E. (1990). Ordering of Cosmid Clones Covering the Herpes Simplex VirusType I (HSV-I) Genome: A Test for Fingerprinting by Hybridization,Nucleic Acids Res. 18, 2653-2660]. Overlapping fragments have also beenidentified by hybridization to groups of end-specific RNA transcripts[Evans, G. A. and Lewis, K. A. (1989). Physical Mapping of ComplexGenomes by Cosmid Multiplex Analysis, Proc. Natl. Acad. Sci., U.S.A. 86,5030-5034]. Both methods require preliminary cloning of the overlappingfragments.

We have devised a new method for identifying neighboring restrictionfragments among the list of sequenced fragments that does not requireeither cloning or sequencing of overlapping fragments. If strands froman alternate digest are sorted, complementary strands of the samefragment will hybridize to different addresses in the sorting array.Whenever intersite segments from two or more fragments of the firstdigest are present within one fragment of the second digest, then all ofthese segments will be represented in both complementary strands of thatone fragment, and all will be present wherever those strands bind in asorting array. We identify the segments by obtaining their signaturesthrough hybridization to specialized binary survey arrays. Thesignatures of intersite segments that occur in one fragment alwaysaccompany each other, whereas signatures of distant segments travelindependently.

After the fragments from an original (first) restriction digest of along DNA have been sequenced, the same DNA is digested with a second(different) restriction endonuclease, the termini of the generatedfragments are provided with universal priming regions (that also restorethe recognition sites at the termini), and the strands are sortedaccording to particular internal sequences, namely, a variable sequenceadjacent to the recognition site for the first restriction enzyme. Thesorting array used is a sectioned binary array (see Example 2.1, below).The array contains immobilized oligonucleotides having a variablesequence as well as an adjacent constant sequence that is complementaryto the recognition sequence of the first restriction endonuclease. Thesorted strands are amplified by “symmetric” PCR, so that in each wellwhere a strand has been bound, copies of the bound strand, as well astheir complements, are generated. In another embodiment, strands can besorted according to their terminal sequences on an array whoseoligonucleotides' constant segments include sequences that arecomplementary to the recognition site of the second restriction enzyme(see Examples 1.1 to 1.3, below). This alternative embodiment foridentifying neighboring fragments is not detailed, but corresponds tothe embodiment discussed below, but with terminal sorting. Any sortingtechnique that results in a sufficiently low number of strands in eachgroup can be employed.

Each strand that hybridizes to the binary sorting array will possess atleast two recognition sites for the second restriction enzyme (restoredat the strand's termini), and at least one (internal) recognition sitefor the first restriction enzyme. The segments included between thesetwo types of restriction sites (intersite segments) comprise theoverlaps between the two types of restriction fragments, and eachintersite segment is thus bounded by any two restriction sites of thetwo types. It follows, that each of these segments can be characterizedby identifying these two restriction sites and variable sequences ofpreselected length within the segment that are immediately adjacent toeach of the restriction sites. The combination of a recognition site(for either the first or the second restriction enzyme) and its adjacentvariable oligonucleotide we call a “signature oligonucleotide”. Everyintersite segment can be characterized by two signature oligonucleotides(of either type) that bound that segment. The combination of the twosignature oligonucleotides is defined herein as the intersite segment's“signature”.

After strand amplification, the strands in the wells of the sortingarray are surveyed to identify the signature oligonucleotides of each ofthe two types. This is carried out by using two types of binary surveyarrays. The first of these binary survey arrays has immobilizedoligonucleotides containing a variable oligonucleotide segment and aconstant segment that is, or includes, an adjacent sequence that iscomplementary to the recognition site for the first restrictionendonuclease. The immobilized oligonucleotides in the second of thesebinary survey arrays have a variable oligonucleotide segment ofpreferably the same length as the variable segment of the firstspecialized survey array, and a constant segment that is, or includes anadjacent sequence that is complementary to the recognition site for thesecond restriction endonuclease. The constant oligonucleotide segmentsin these binary survey arrays can be located either upstream ordownstream of the variable oligonucleotide segments, resulting in thesurveying of either the downstream or the upstream signatureoligonucleotides in each strand of the intersite segments beingsurveyed. In a preferred embodiment the constant oligonucleotidesegments are upstream from the variable segments, and the immobilizedoligonucleotides have free 3′ ends, so that they can be extended byincubation with a DNA polymerase (see Example 5.1.4, below). In thediscussions below, we will assume that this preferred embodiment is usedfor surveying. From the oligonucleotide information that is obtained,the sequenced fragments can be ordered relative to one another.

The principle of this method is illustrated in FIG. 14. The top diagramshows a region of a double-stranded DNA molecule that containsrecognition sites for two different restriction endonucleases (A and B).Each recognition site is adjacent to an upstream oligonucleotide segmentof a variable sequence (represented as a shaded square, and identifiedby a code, in which the first character is the type of restrictionsite). The sequence of a recognition site, in combination with thesequence of its adjacent oligonucleotide, is responsible for thehybridization of its DNA strand to the oligonucleotide arrays used inthis method. Such a combination will be called an “A-type signatureoligonucleotide” or a “B-type signature oligonucleotide”. A digest ofthe DNA with the A-type restriction enzyme contains fragments X and Y.(Assume those fragments have been sequenced.) Digestion of the same DNAregion with the B-type restriction enzyme gives rise to a chimericfragment that contains the right intersite segment of fragment X (i.e.,X_(R)) and the left intersite segment of fragment Y (i.e., Y_(L)). Afterdigestion, the terminal recognition sites in the B-type restrictionfragments are restored by the introduction of priming regions, and thestrands are then melted apart and hybridized to an A-type sorting array.

Each of the immobilized oligonucleotides in the A-type sorting arrayconsists of a sequence complementary to the A-type restriction site anda variable segment. The array is comprehensive as far as variablesequences are concerned, so that every strand is bound in one or morelocations in the array. An A-type array, rather than a B-type array, isused to sort B-type restriction fragments in the illustrations of FIGS.14, 15, and 16. Therefore, the strands bind to the array by theirinternal regions. The complementary strands of B-type fragmentX_(R)Y_(L) will hybridize at two different addresses (i.e. wells) in thesorting array, as shown in the bottom diagram of FIG. 14. When thestrands are amplified (in a polymerase chain reaction), each strandgives rise to its complementary copy, restoring each strand of therestriction fragment X_(R)Y_(L) at each of those two addresses.

Our method obtains the signature of every intersite segment. Intersitesegments X_(R) (whose signature consists of oligonucleotides B2 and A3)and Y_(L) (whose signature consists of oligonucleotides B3 and A4) areseen together at two different addresses, A3 and A4 (i.e., wells) in thesorting array, indicating that segments X_(R) and Y_(L) are present inthe same B-type fragment, and therefore neighbor each other in theundigested DNA. In addition to establishing that the two A-typefragments (X and Y) are neighbors, our method determines the orientationof their linkage, i.e., that the right side of fragment X is linked tothe left side of fragment Y. This can be determined even if otherfragments are present at each of the addresses, because the segments ofthese other fragments will appear together at different combinations ofaddresses, i.e., it is highly unlikely that the signatures of otherintersite segments from the first well will also appear in the secondwell where X_(R) and Y_(L) are found.

After the B-type fragments have been sorted into groups on an A-typesorting array as discussed above and shown in FIG. 14, each group isanalyzed (surveyed) by hybridization to the two types of binary surveyarrays discussed above, A and B. Oligonucleotides of the A-type binarysurvey array contain in their constant segments a sequence that iscomplementary to the A-type restriction site, whereas the constantsegments in the B-type binary survey array include a sequence that iscomplementary to the B-type restriction site. Since every intersitesegment that occurs in a B-type fragment will be bordered by a pair ofrestriction sites (each of which can be either A-type or B-type), everysegment hybridizes to two different areas in the survey arrays. If thetwo surveyed signature oligonucleotides in each intersite segment thatconstitute a signature are each fourteen nucleotides long(6-nucleotide-long restriction site plus an 8-nucleotide-long variablesegment), their combined length will be 28 nucleotides. The signature islikely to be unique, even though the variable segment of each probe israther short. Because the sequence of every A-type fragment is alreadyknown, every intersite segment can be identified from its signature, andneighboring fragments from the first digest can be identified.

For example, FIG. 15 shows four previously sequenced fragments (M, N, Oand P) produced by digestion of a DNA with restriction endonuclease A.Because the sequence of each A-type fragment is known, we can predict:the sites where these fragments will be cleaved by restriction enzyme B,the addresses in the sorting array where segments of these fragmentswill hybridize, and the signatures those segments will possess. Some ofthe fragments contain a restriction site for a second digestion withrestriction enzyme B. The intersite segments are M_(L), M_(R), N, O_(L),O_(I), O_(R), P_(L), and P_(R), as shown. (“I” refers to an internalsegment). Some segments are bordered by one A-type restriction site andone B-type restriction site (such as segment M_(L)); some are borderedby two A-type restriction sites (such as segment N); and some arebordered by two B-type restriction sites (such as segment O_(I)). Thesignature oligonucleotides of each type are found at the 3′ terminus ofeach strand of an intersite segment. Fragment O possesses two B-typerestriction sites. Therefore, its internal segment, O_(I), will nothybridize to the A-type sorting array, because it lacks an A-typerestriction site. On the other hand, fragment N lacks B-type restrictionsites. Accordingly, it is entirely contained in the intersite segment Nwhose signature consists of two A-type oligonucleotides. All thesegments' signatures will be found at four addresses in the sortingarray: A11, A23, A33, and A43.

FIG. 16 shows how the data obtained with the A-type and B-type surveyarrays can be utilized to order the A-type fragments shown in FIG. 15.First, for each occupied address in the sorting array, a list of allsurveyed oligonucleotides of the A and B type is prepared. From allpossible pairwise combinations of these oligonucleotides, only thosethat are contained in the “key”, as shown in FIG. 16, are chosen,because only those combinations correspond to the already knownsignatures of real intersite fragments. If every signature is unique(i.e., belongs to only one intersite segment), then the segments can beidentified unambiguously. By comparing the sets of intersite segmentsfound at different addresses, the intersite segments that occur togetherat more than one address can be determined. This identifies “companion”segments. Lack of a companion indicates that the segment occupies aterminal position in the DNA. We then use the information obtained toorder the fragments, as shown at the bottom of FIG. 16.

If an A-type fragment is completely embedded in a B-type fragment, sothat there are no B-type restriction sites within that fragment (as infragment N), its position between the neighboring A-type fragments isestablished, though without regard to its orientation. It is alsopossible that a B-type fragment will include a number of A-typefragments. In this case, the location of the entire group of fragmentsbetween the outer segments of the B-type fragment will be established.However, the orientation of the internal A-type fragments and theirposition relative to one another will be unknown. We have devised asimple solution to this problem. The fragments from the B-type digestcan be re-digested with a restriction enzyme whose recognition site isshorter. For example, if restriction endonucleases with a hexamericrecognition sequence were employed to produce the A-type and B-typefragments, a restriction enzyme with a tetrameric recognition sequencewould be appropriate. Since tetramers occur in nucleotide sequences 16times as frequently as hexamers, there would be almost no A-typefragments that lack the tetrameric recognition site within theirsequence. After hybridization of the secondary digest to an A-typesorting array, only 1/16 of the original DNA will remain bound. Ananalysis of the signatures of the bound intersite segments that arebordered (on one side) by the tetrameric recognition site, performed asdescribed above, will allow the fragments in a group to be ordered. Inthis case, in addition to the A-type survey array, a new binary surveyarray is used, whose oligonucleotides' constant segments include asequence that is complementary to the tetrameric restriction site.

The resolving power of this method of identifying neighboring sequencedrestriction fragments depends on three probabilistic factors. The firstfactor is the probability that two distant pairs of neighboringfragments will share the same combination of addresses in the sortingarray. The second factor is the probability that the same signature willbe shared by two or more segments that occur in the sequencedrestriction fragments. If a human genome is digested with restrictionendonucleases that have hexameric recognition sites, if the digest issorted on an array containing variable octanucleotides, and if theA-type and B-type survey arrays also contain immobilizedoligonucleotides with a variable octanucleotide sequence, then each ofthese two probabilities will be quite low except for fragments fromhighly repetitive regions of the genomic DNA. Most of the uncertainty inordering fragments will result from a third factor, which is due to thefact that the two oligonucleotides that constitute a signature aredetermined independently. If fragments from DNA of the size of the humangenome are being ordered, the survey data for each well in the sortingarray will include, on average, about 22 A-type oligonucleotides andabout 22 B-type oligonucleotides, which will result in approximately 750different pairwise combinations of A:B and A:A types. Some of thesecombinations will correspond to signatures of intersite segments thatactually occur in the genome, but are not present in that well,resulting in the segment being erroneously identified. However, even ifthis third factor is accounted for, about 99 percent of all neighboringfragments are expected to be identified in one round of the orderingprocedure. Analysis of four to five alternate restriction digests, whilenot required for the invention, will allow virtually all the sequencedfragments to be ordered. Thus, for the human genome, only a fewadditional arrays would be needed to order all the fragments, and thisis several orders of magnitude less expensive and time-consuming thanrepeating the entire sequencing procedure for each additionalrestriction digest.

Signatures of fragments could be obtained by other methods, such as byhybridizing each group of fragments to a survey array ofoligonucleotides with long variable segments (in such a case, asignature would be defined to be one long oligonucleotide). However, tostatistically predict that a signature will be unique in, for example, ahuman genome, it should be about 30 nucleotides long. If a 28 nucleotidelong signature is chosen, it would result in variable segments 22nucleotides long that are adjacent to a hexameric restriction site. Asurvey array containing all possible variable segments of such a lengthwould contain approximately 10¹³ areas. That would be an extremely largearray. Our method for obtaining composite (two-membered) signatures ismuch superior economically.

In our method, the uniqueness of a signature is achieved by surveying“half signatures” (signature oligonucleotides) on two relatively smallsurvey arrays. If the variable segments in those arrays are8-nucleotide-long, the overall number of individual areas in the twoarrays is approximately 130,000, or approximately 100,000,000 timessmaller than the single array that would be needed for detecting thesame size signature (28 nucleotides).

Instead of surveying signature oligonucleotides, the intersite segmentscan also be identified in the wells of the sorting array bycomprehensive surveys of all oligonucleotides that are contained in thestrands sorted into that well. For example, comprehensive survey arrayssimilar to those described herein for surveying partials could beemployed. The oligonucleotide pattern in each well of the sorting arraywould very likely be different and, since the oligonucleotide content ofeach intersite segment is known (because their sequences are known), onecould try to decompose the oligonucleotide patterns into individualoligonucleotide sets of the intersite segments. However, inasmuch as theoligonucleotide patterns would be very complex, and the number ofintersite segments is very large (more than a million if the restrictionsites are hexameric), it would be a very difficult task. At the sametime, comprehensive surveys of the oligonucleotides that are containedin the strands sorted into wells can be useful for resolving ambiguitiesthat remain after analysis with arrays that identify signatureoligonucleotides, especially for resolution of the ambiguities caused bythe second and the third probabilistic factors discussed above. Sincemost of the intersite segments in a well of the sorting array will havebeen identified unambiguously, only a few alternative solutions need tobe assessed to determine the remaining intersite segments. For thispurpose the actual oligonucleotide pattern observed in the well can becompared with a simulated pattern obtained by combining theoligonucleotides in the known intersite segments with theoligonucleotides in the remaining alternative intersite segments.

If a diploid genome (such as a human genome) is sequenced, the orderedfragments will appear as a string of unlinked pairs of allelicfragments. What remains unknown is how the allelic fragments in eachpair are distributed between the homologous (sister) chromosomes thatcame from each parent. Allocation of the allelic fragments to these“chromosomal linkage groups” requires knowledge of which fragment ineach pair is linked to which fragment in a neighboring pair.

We have developed a method that uses oligonucleotide arrays forallocating allelic fragments to chromosomes, irrespective of what methodwas used for sequencing and ordering the fragments. The linkage offragments in neighboring pairs can be achieved by sequencing arestriction fragment (“spanning fragment”) from an alternate digest thatspans at least one allelic difference in each of the pairs. However,since the sequences of the allelic fragments are known, there is no needto sequence the spanning fragment. Instead, one can simply determinewhich oligonucleotides that harbor allelic differences in neighboringpairs of fragments accompany one another in the spanning fragment, i.e.,which oligonucleotides occur in the same chromosome. This can beaccomplished by surveying, at a selected address in a partialing array,partials generated from a selected group of restriction fragments froman alternate digest. A group of restriction fragments is selected thatcontains a spanning fragment, and an address in a partialing array isselected that encompasses a difference in one of the neighboring allelicpairs.

The top diagram in FIG. 17 shows a string of unlinked pairs of allelicfragments, whose order has been determined. The position in each pair offragments where an allelic difference occurs is indicated by dissimilarsymbols. Since the sequence of every fragment is known, it is possibleto choose an alternate restriction fragment that spans the allelicdifferences in the neighboring pairs. A spanning restriction fragment,in fact, may already be present at a particular address in one of thesorting arrays used to sort alternate digests during the orderingprocedure. The aim of the procedure, as illustrated in the figure, is toascertain whether the allelic difference represented by a cross or atriangle occurs within the same spanning fragment as the allelicdifference represented by a diamond or a circle. In the figure, theallelic difference represented by a diamond or a circle was arbitrarilychosen to serve as a reference point, with the allocation of the otherpair of allelic differences being unknown.

The sorted strands are melted apart, and the mixture is hybridized to aparticular well in the partialing array, whose address corresponds to anoligonucleotide that encompasses the reference point. In thisillustration, two different wells are selected, each with an addressthat corresponds to an oligonucleotide that harbors the allelicdifference represented by the circle or the diamond. Also, for thisillustration the method of generating partials directly on a sectionedarray is used (see Example 3.3, below). As discussed above, othermethods of preparing partials could be used. After amplification of thepartial strands, the oligonucleotides in the two wells are identifiedwith a survey array. It can be seen from an examination of the surveyarrays schematically depicted at the bottom of the figure that theoligonucleotides that encompass the allelic difference represented by acircle are accompanied by the oligonucleotides that encompass theallelic difference represented by a cross, while the oligonucleotidesthat encompass the allelic difference represented by a diamond areaccompanied by the oligonucleotides that encompass the allelicdifference represented by a triangle. We thus determine that thefragments containing the marker nucleotides represented by the diamondand the triangle are located on one chromosome, whereas the fragmentscontaining the marker nucleotides represented by the circle and thecross are located on the other chromosome.

To allocate allelic pairs to chromosomal linkage groups, it may only benecessary to survey one oligonucleotide encompassing an allelicdifference. The particular oligonucleotide that should be surveyed canbe determined, if desired, by analyzing the known sequences of thepartials in the mixture surveyed. Similarly, it may only be necessary tosurvey at one address in the partialing array. Having redundant data,however, is preferable, in order to avoid errors that can otherwisearise.

Since allelic differences occur roughly once every 1,000 basepairs inthe human genome, most allelic fragments resulting from digestion with arestriction enzyme recognizing a hexameric sequence (resulting in about4,096 average length) will differ from each other. If the variableoligonucleotide segments in the survey arrays are made ofoctanucleotides, then each allelic nucleotide substitution will giverise to eight different oligonucleotides in each of the allelicfragments. However, using our method, inspection of only one address inthe partialing array is sufficient to reveal the linkage of thecorresponding reference oligonucleotide to any one of the eightoligonucleotides that encompass the nucleotide substitution that occursin the neighboring fragment on the same chromosome. Therefore, only oneaddress in the partialing array is needed to reveal the linkages betweenevery two neighboring allelic pairs. Thus, 65,536 linkages can bedetermined on a single comprehensive partialing array made of variableoctanucleotides. With this method, only 10 to 20 of these arrays wouldbe needed to complete the assembly of an entire diploid human genomethat has been fragmented by a restriction endonuclease with a hexamericrecognition site.

A power of the sequencing method of this invention is that the highredundancy in the information obtained allows the original hybridizationdata to be refined by computer analysis, thereby ensuring thereliability of the final results.

In a preferred embodiment described in detail above and in the examples,complementary strands of each DNA fragment of a first restriction digestbind at two addresses in a terminal sequence sorting array, eachaccording to the identity of its 3′-terminal sequence. Subsequentamplification results in both complementary strands being present atboth addresses. However, the complementary strands will be enmeshed in adifferent group of strands at these two addresses. The mixture ofstrands at each address (area of the sorting array) is separatelysequenced by generating complete sets of partials for each, andseparately surveying the oligonucleotide content of each well in thepartialing array, as described. The different sequence contexts (in thetwo addresses of the sorting array) will interfere differently with thetwo strands' sequence determinations, allowing the exclusion of manyambiguities by comparing the information obtained at the two addresses.Furthermore, each of the complementary strands of the same restrictionfragment will be sequenced independently within the two groups that thestrands of the restriction fragment are sorted into. Becausecomplementary sequences can be derived from each other, the complete setof data for each fragment will be independently collected four timesduring the entire procedure. Also, the data collected from complementarystrands by our method provide an additional opportunity to discriminateagainst mismatched hybrids. In contrast to perfect matches, mismatchesproduced by the complementary strands with the immobilizedoligonucleotides will result in different hybrid stabilities. Forexample, the relatively high stability of a G:T mismatch potentiallyproduced by one strand contrasts with the lower stability of the C:Amismatch that can potentially be formed by the corresponding region ofthe complementary strand.

When strand sets are identified, all the oligonucleotides of each strandshould occur in every pertinent address set. Thus, every strand set (orevery pair of quasi-identical strand sets) will be determined as manytimes as the number of different oligonucleotides it contains.

If a strand set is determined incorrectly (i.e., if someoligonucleotides were missed or some were erroneously included), therewill be unfilled gaps in the reconstructed sequence, or some blocks willoccur that cannot be accommodated within any gap, thus indicating anerror. And finally, with the method described herein of preparingstrands by restriction digestion and end extension with priming regions,each strand will possess known restriction sequence tags at its ends(and only at the ends). This means that if a sequence does not beginwith, or is not terminated by, those tag sequences, it is not a correctsequence and has resulted from errors in the data. Subsequent analysiscan pinpoint possible reasons for the errors, and can provide theadditional information needed to correct them.

Thus, there are many possibilities using the arrays and methods of thisinvention to filter out experimental errors that arise due toimperfections in the hybridization procedure. A basic feature ofsequencing by hybridization is that every nucleotide position isreflected in n different independently identified oligonucleotides(where n is the length of the oligonucleotide probe). This ensures thatno nucleotide in a sequence will be incorrectly deleted, inserted, ormisidentified. In any case, a sequence error will not be overlooked, andall ambiguities that remain in a sequence can be identified andlocalized. Furthermore, most of the remaining ambiguities can beresolved when each sequence is verified by comparing it to the otherversion of that sequence that is found at another area in the strandsorting array, and by comparing it to the two versions of itscomplementary sequence.

Our methods for handling and manipulating the oligonucleotideinformation obtainable with our arrays and methods, can easily beconverted into the form of computer algorithms by well-known techniques.Moreover, preliminary computer simulations can be used to furtherimprove sequencing with particular embodiments. For these simulations, anumber of different types of nucleotide sequences can be used as input.Natural sequences that are present in the GenBank library can beemployed. Random sequences can also be constructed so that they resemblethe human genome. Some of the characteristics that could bepredetermined are nucleotide composition, dinucleotide frequency,frequency of restriction sites, the presence of telomeres andcentromeres, and the presence of repeated segments.

A sequence (along with its complementary copy) can be algorithmically“digested” with a restriction enzyme, the ends provided with terminalpriming regions, and the strands sorted into groups according to theidentity of their terminal sequences. For the mixture of strands in agroup, all possible one-sided partials can be generated, and then sortedaccording to the identity of oligonucleotide segments at their variabletermini (addresses). For every address, a complete list can be preparedof the oligonucleotides that are present in the partials at thataddress. These upstream subsets can be used to generate the downstreamsubsets and the address sets of each address. The unindexed address setscan then be decomposed into strand sets, the sequence blocks from eachstrand set can be formed, and the order of the blocks can be establishedfrom their distribution among the upstream and downstream subsets ofeach address. After the sequences of the fragments in each of the groupshas been determined, the sequences can be analyzed to identifyrestriction sites for those restriction endonucleases that are likely tobe most useful in determining the order of the fragments. Collections ofsignature oligonucleotides can be generated that would occur at eachaddress when fragments from alternate digests are sorted on an array.The distribution of signature oligonucleotides among the addresses inthe sorting array can then be analyzed to order the sequenced fragments.A program that uses methods of analysis such as those described hereinto determine nucleotide sequences (or a program that uses other methodsof analysis) can be tested by comparing the assembled sequences to theinput sequences.

To further develop useful methods of computer analysis, the mock“haploid genomes”, represented by the input sequences, can be convertedinto “diploid genomes”, by introducing random nucleotide substitutionsinto a copy of each of the original DNAs. Furthermore, insertions,deletions, inversions, transpositions, and recombinations can beintroduced, in order to simulate the picture that is observed in a realgenome. These diploid genomes can be analyzed as described above. Afterthe “allelic pairs” are ordered, the fragments can be assembled intotheir original “chromosomes” from an analysis of the oligonucleotidesthat are present in selected partials from alternate restriction“digests.” The results of these simulated sequence determinations can beanalyzed, in order to improve the methods of analysis, and to find waysof reducing the number of ambiguities by purely algorithmic means.

The frequencies with which different types of ambiguities occur (whendetermining fragment sequences, and when linking fragments) can beassessed as a function of the sizes of the oligonucleotides used in thearrays. Simulations can be carried out in which the length of thevariable segment within the immobilized oligonucleotides in each type ofarray is varied, in order to ascertain the combination of array sizesthat is optimal (that is, to determine which combination of array sizesis likely to result in the lowest frequency of ambiguities, keeping inmind the need to minimize the time and expense of carrying out a totalsequence analysis). Similarly, the effect of average fragment length(which depends on which restriction enzyme is used to cleave the nucleicacid(s) being analyzed) can be assessed.

Computational methods can be developed to minimize or eliminate errorsthat occur during partialing and surveying, by taking advantage of thehigh redundancy in the data. Such methods should take into account thefollowing aspects of a preferred sequencing procedure: the sequence ofevery fragment is independently determined four times (by virtue of eachstrand and its complement being present at two different addresses inthe sorting array); each strand set is determined in as many trials asthe number of different oligonucleotides in that strand; everynucleotide in a strand is represented by as many differentoligonucleotides as the length (of the variable segment) of theimmobilized oligonucleotides in the survey array; the locations where aparticular block can occur in a sequence are limited by the distributionof the blocks among the upstream and downstream subsets of eachpertinent address; and the edges of a block must be compatible with theedges of each gap where that block is inserted. The following sources oferror can be considered:

(1) Errors resulting from signal differences due to the differentmultiplicities of the oligonucleotides in the sample.

A threshold limit can be applied, thus excluding some rareoligonucleotides from the data. This altered data can then be offered toa sequence reconstruction program, in order to evaluate the tolerance ofthe method of analysis for the presence of those errors in the data. Theoutcome of these simulations can be used to predict the maximal DNAlength and the maximal number of strands that can be present in amixture, and still allow unambiguous sequence determination.

(2) Errors resulting from the presence of strong secondary structures inthe strands.

Hairpin formation within a strand can compete with the formation of ahybrid, if undegraded partials are applied to a survey array. In orderto simulate this situation, regions within an input sequence should beidentified that have the potential to form such a secondary structure,and the signal strength of the corresponding hybrids should be reducedaccordingly. This will result in the disappearance of someoligonucleotides from the input data, depending on their involvement inhighly stable hairpins and on the relative content of thoseoligonucleotides in the strands. A sequence might be reconstructed, evenif a set of overlapping oligonucleotides is missing from the data. Theidea is to use the partialing information that can be obtained fromcomplementary strands; in these strands, the gaps will occur ondifferent sides of a hairpin.

(3) Errors resulting from false signals due to the presence ofmismatched hybrids.

As was discussed above, related regions of complementary strands willgive rise to mismatched hybrids with different stabilities, because aG:T mismatch is stronger than its C:A counterpart. A comparison of thesets of data obtained from each complementary strand can be used todistinguish between perfect and mismatched hybrids.

(4) Random errors.

Simulations can be carried out in which some data are randomly deletedfrom oligonucleotide lists, and false data is randomly inserted, inorder to assess the ability of the method of analysis to tolerate randomerrors.

The goal of all such simulations is to select the optimal size for theoligonucleotides used in the different types of arrays. This informationcan also be used to predict the ratio of signal to noise that must beachieved in the hybridization procedures in each particular case.

Once optimal parameters for the various steps are established, furtherimprovements can be achieved by the economical use of the spaceavailable in the arrays. For example, a preliminary survey of thesignature oligonucleotides that are present at each address in the firstsorting array, will indicate which groups of strands can be mixedtogether before analyzing them on a partialing array, withoutinterfering with sequence determination. This can markedly reduce thenumber of partialing and survey arrays that are needed. In addition, thedistribution of restriction sites within the sequenced fragments can beanalyzed in order to select those restriction enzymes that will providethe most useful information for ordering the fragments. The sequencedfragments can also be analyzed to identify, for every two neighboringallelic pairs, a group of restriction fragments that contains a fragmentthat spans the allelic differences, and whose other fragments will notinterfere with the identification of the oligonucleotides that encompassthe allelic differences.

Using our genome sequencing method, one can use throughout essentiallythe same technology, i.e., hybridization of oligonucleotide probes andthe amplification of nucleic acids by the polymerase chain reaction,both of which are well-studied, common laboratory techniques. The entireprocedure can be performed by a specially designed machine, resulting inhuge reductions in time and cost, and a marked improvement in thereliability of the data. Many arrays could be processed simultaneouslyon such a machine. The machine most preferably should be entirelycomputer-controlled, and the computer should constantly analyzeintermediate results. As stated above, used arrays can be stored, bothto serve as a permanent record of the results, and to provide additionalmaterial for subsequent analysis or for manipulating the sequencedstrands and partials.

The route followed by each fragment through the described series ofarrays is uniquely determined by its particular sequence. By discerningthe path that each fragment takes, a computer associated with themachine can accurately reconstruct the sequence of a subject genome.

The result of the analysis of an individual's genomic DNA, using themethod described above, is the complete nucleotide sequence of thatindividual's diploid genome. The genes, and their control elements,would be allocated into chromosomal linkage groups, as they appear in asingle living organism. The sequence will thus describe an intact,functioning ensemble of genetic elements. This complete genomesequencing provides the ability to compare the genomes of manyindividuals, thereby enabling biologists to understand how genesfunction together, and to determine the basis of health and disease. Thegenomes of any species, whether haploid or diploid, can be sequenced.

The invention can be used not only for DNA's but as well for sequencingmixtures of cellular RNAs.

The invention is also useful to determine sequences in a clinicalsetting, such as for the diagnosis of genetic conditions.

VI. Examples 1. Sorting Nucleic Acids or their Fragments on a BinaryOligonucleotide Array Whose Immobilized Oligonucleotides have Free 3′Termini, with their Constant Segments Located Upstream of the VariableSegments

This method allows the immobilized oligonucleotides on the binary arrayto serve as primers for copying bound DNA or RNA strands, resulting inthe formation of their complementary copies covalently linked to thesurface of the array. In such an embodiment, the array can be vigorouslywashed after the extension of the immobilized oligonucleotides to removeany non-covalently bound material. Moreover, these arrays containingcovalently bound strands can be stored and used as a permanent libraryfrom which additional copies of the sorted strands can be generated. Ifamplification of the sorted strands on the binary array is desired, thearray can be sectioned. For example, strands can be sorted on a plain(unsectioned) binary array, and the array can be sectioned at a laterdate. Sorting need not be carried out on sectioned arrays. Ifamplification is not required using the methods of the examples, thensectioned arrays may not be necessary.

1.1. Sorting Restriction Fragments According to their TerminalSequences, Following the Introduction of Terminal Priming Regions—

DNA is digested using a restriction endonuclease. Recognition sites forthe restriction endonuclease are restored in solution by introducingterminal extensions (adaptors) that contain a sequence which, togetherwith the restored restriction site, form a universal priming region atthe 3′ terminus of every strand in the digest. This priming region islater used for amplification of the sorted strands by PCR. After meltingfragment strands apart, the strands are sorted on a sectioned binaryarray. A sequence complementary to the generated priming region servesas both the constant segment of the oligonucleotides immobilized in thesectioned binary array, and as the primer for PCR amplification of thebound strands.

The sequence of the primer (as well as the priming region) is chosen insuch a manner that it is well suited for PCR. The criteria for selectinggood primers are discussed in detail by Sambrook et al. (1989), Erlichet al. (1991), and Wu, D. Y., Ugozzoli, L., Pal, B. K., Qian, J. andWallace, R. B. (1991). The Effect of Temperature and OligonucleotidePrimer Length on the Specificity and Efficiency of Amplification by thePolymerase Chain Reaction, DNA Cell Biol. 10, 233-238. Briefly, theprimers should be long enough (preferably 15-25 nucleotides) to be ableto hybridize to a DNA strand at a temperature that is optimal forpolymerization. The primer should not be self-complementary, to avoidthe formation of an internal secondary structure within the primermolecule, and to avoid the formation of a duplex between two primermolecules.

It is preferable that all recognition sites of the endonuclease used forDNA digestion be eliminated from the fragments' internal regions duringdigestion. This further ensures that the fragments' strands are bound tothe sorting array only by their terminal regions, and that PCR is alwaysprimed only at the strand ends, resulting in amplification of onlyfull-sized copies of the strands.

Naturally occurring modification of some bases in DNA often inhibits DNAcleavage at modified sites. In higher vertebrates, including humanbeings, cytosine residues are believed to be the only bases that aremodified (methylated), producing 5-methylcytosine [Doerfler, W. (1983).DNA Methylation and Gene Activity, Annu. Rev. Biochem. 52, 93-124], withmodification occurring mainly within the CG dinucleotide [Cooper, D. N.(1983). Eukaryotic DNA Methylation, Human Genetics 64, 315-333]. Sitescontaining 5-methylcytosine are not cleaved by most restrictionendonucleases [Kessler, C. and Höltke, H. J. (1986). Specificity ofRestriction Endonucleases and Methylases—A Review, Gene 47, 1-153].Complete DNA digestion can be achieved in higher vertebrates either byDNA demethylation prior to the digestion [Gjerset, R. A. and Martin, D.W., Jr. (1982). Presence of a DNA Demethylating Activity in the Nucleusof Murine Erythroleukemic Cells, J. Biol. Chem. 257, 8581-8583], byusing restriction endonucleases whose recognition sites do not containcytosine [such as Aha III/Dra I (site TTTAAA) or Ssp I (site AATATT)],or by using restriction endonucleases whose activity is not influencedby cytosine methylation [such as Tag I (site TCGA), Kpn I (site GGTACC),or HpaI (site GTTAAC)]. Such restriction endonucleases are known in theart, and many are reviewed by Kessler and Höltke (1986), supra.

1.1.1. Method in which a Priming Region is Introduced by FragmentLigation to Double-Stranded Synthetic Oligodeoxyribonucleotide Adaptors—

DNA to be analyzed is first digested substantially completely with achosen restriction endonuclease, and the fragments obtained are thenligated to synthetic double-stranded oligonucleotide adaptorsessentially as described by Sambrook et al. (1989), supra, and also byKintzler and Vogelstein [Kintzler, K. W. and Vogelstein, B. (1989).Whole Genome PCR: Application to the Identification of Sequences Boundby Gene Regulatory Proteins, Nucleic Acids Res. 17, 3645-3653]. Theadaptors have one end that is compatible with the fragment termini. Theother end is not compatible with the fragments' termini. The adaptorscan therefore be ligated to the fragments in only one orientation. Meansfor making compatible and incompatible ends are well known to oneskilled in the art.

The adaptors' strands are non-phosphorylated, as results fromconventional oligonucleotide synthesis [Horvath, S. J., Firca, J. R.,Hunkapiller, T., Hunkapiller, M. W. and Hood, L. (1987). An AutomatedDNA Synthesizer Employing Deoxynucleoside 3′-Phosphoramidites, MethodsEnzymol. 154, 314-326], which prevents their self-ligation. The strandsin the restriction fragments have their 5′ termini phosphorylated, whichresults from their cleavage by a restriction endonuclease. This favorsthe ligation of the adaptors by a DNA ligase (such as the DNA ligase ofT4 bacteriophage) to the restriction fragments, rather then to eachother. Since DNA ligase catalyzes the formation of a phosphodiester bondbetween adjacent 3′ hydroxyl and phosphorylated 5′ termini in adouble-stranded DNA, the phosphorylated 5′ termini of the fragments areligated to the adaptor strand whose 3′ end is at the compatible side ofthe adaptor. The 3′ termini of the fragments remain unligated. A DNApolymerase possessing a 5′-3′ exonuclease activity (such as DNApolymerase I from Escherichia coli or Tag DNA polymerase from Thermusaquaticus) is then used to extend the 3′ ends of the fragments,utilizing the ligated oligonucleotide as a template, concomitant withdisplacement of the unligated oligonucleotide. To make the ligatedoligonucleotide resistant to the 5′-3′ exonuclease, the ligatedoligonucleotide can be synthesized from α-phosphorothioate precursors[Eckstein, F. (1985). Nucleoside Phosphorothioates, Annu. Rev. Biochem.54, 367-402]. Synthesis of phosphorothioate oligonucleotides is known inthe art [Matsukura, M., Zon, G., Shinozuka, K., Stein, C. A., Mitsuya,H., Cohen, J. S. and Broder, S. (1988). Synthesis of PhosphorothioateAnalogs of Oligodeoxyribonucleotides and Their Antiviral ActivityAgainst Human Immunodeficiency Virus, Gene 72, 343-347].

Although the oligonucleotide adaptors are provided in great excessduring the ligation step, there is still a low probability that tworestriction fragments will ligate to one another, rather then to theadaptor. To prevent this, the ligation products can again be treatedwith the restriction endonuclease used to generate the fragments, inorder to cleave the formed interfragment dimers. The endonuclease willnot cleave the ligated adaptors if they are synthesized from modifiedprecursors (such as nucleotides containing N⁶-methyl-deoxyadenosine),which are known and currently commercially available [e.g., fromPharmacia LKB]. Resistance of the ligated adaptors to digestion by therestriction endonuclease can be increased further if the ligatedoligonucleotide is synthesized from phosphorothioates, and ifphosphorothioate analogs of the nucleoside triphosphates are used assubstrates for extension of the 3′ termini of the fragments, instead ofutilizing natural nucleoside triphosphates as substrates [Eckstein, F.and Gish, G. (1989). Phosphorothioates in Molecular Biology, TrendsBiol. Sci. 14, 97-100].

It is not necessary that all these steps (digestion, ligation,extension, repetitive digestion) be performed separately. The necessaryenzymes and substrates can be added into the same reaction mixture,without interference from one another. Moreover, the presence of theappropriate restriction endonuclease in the ligation mixture can beadvantageous, because undesirable interfragment links will be destroyedas soon as they are formed.

After the priming regions have been added, the complementary strands aremelted apart, such as by increasing temperature and/or by introducingdenaturing agents such as guanidine isothiocyanate, urea, or formamide.The resulting strands are then hybridized to a binary sorting array,such as by following a standard protocol for the hybridization of DNA toimmobilized oligonucleotides [Gingeras, T. R., Kwoh, D. Y. and Davis, G.R. (1987). Hybridization Properties of Immobilized Nucleic Acids,Nucleic Acids Res. 15, 5373-5390; Saiki, R. K., Walsh, P. S., Levenson,C. H. and Erlich, H. A. (1989). Genetic Analysis of Amplified DNA withImmobilized Sequence-specific Oligonucleotide Probes, Proc. Natl. Acad.Sci., U.S.A. 86, 6230-6234]. Hybridization is performed so thatformation of only perfectly matched hybrids is promoted. The hybridshave a length which is equal to that of the immobilizedoligonucleotides. The binary array contains immobilized oligonucleotidesthat are attached to the array at their 5′ termini and contain constantrestriction site segments adjacent to a variable segment ofpredetermined length. Each strand will be bound to the array at its 3′terminus. Its location within the array will be determined by theidentity of the oligonucleotide segment that is located in the strandimmediately upstream from the restored restriction site at its 3′ end,and that is complementary to the variable segment of the immobilizedoligonucleotide to which the strand is bound. After hybridization andwashing away all unbound material, the entire array is incubated with aDNA polymerase, such as Tag DNA polymerase deoxyribonucleotide 5′triphosphates or the DNA polymerase of bacteriophage T7, and substrates.As a result, the 3′ end of each immobilized oligonucleotide to which astrand is bound will be extended to produce a complementary copy of thebound strand. The array is then vigorously washed under conventionalconditions that remove the hybridized DNA strands and all other materialthat is not covalently bound to the surface. The wells in the array arethen filled with a solution containing universal primer, an appropriateDNA polymerase, and the substrates and buffer needed to carry out apolymerase chain reaction. Preferably, the DNA polymerase is a highlyprocessive and thermostable DNA polymerase with a high-temperatureoptimum, which can be used under conditions in which the secondarystructure of single-stranded DNA is destabilized; for example, somevariants of Taq DNA polymerase [Erlich et al. (1991)]). The array isthen sealed, isolating the wells from each other, and exponentialamplification is carried out, preferably simultaneously, in each well ofthe array. After amplification, the DNA in each well may be withdrawnfor subsequent analysis.

1.1.2. Method in which a Priming Region is Introduced by FragmentLigation to Single-Stranded Synthetic Oligoribonucleotide Adaptors—

After digestion of DNA with a restriction endonuclease, the 5′ terminiof the resulting fragments (which are phosphorylated) are ligated to asingle-stranded 3′,5′-hydroxyl oligoribonucleotide adaptor with an RNAligase, such as the RNA ligase of bacteriophage T4 in order to restorethe restriction recognition sequence and introduce a priming region.[Higgins, N. P., Gebale, A. P. and Cozzarelli, N. R. (1979). Addition ofOligonucleotides to the 5′-Terminus of DNA by T4 RNA Ligase, NucleicAcids Res. 6, 1013-1024]. Synthesis of oligoribonucleotides is known inthe art [Sampson, J. R. and Uhlenbeck, O. C. (1988). Biochemical andPhysical Characterization of an Unmodified Yeast Phenylalanine TransferRNA Transcribed in vitro, Proc. Natl. Acad. Sci. U.S.A. 85, 1033-1037;Chou, S. H., Flynn, P. and Reid, B. (1989). Solid-phase Synthesis andHigh-resolution NMR Studies of Two Synthetic Double-helical RNADodecamers: r(CGCGAAUUCGCG) and r(CGCGUAUACGCG), Biochemistry 28,2422-2435]. To make the oligoribonucleotides of the adaptor more stable,they can be synthesized from α-phosphorothioate ribonucleotideprecursors [Milligan, J. F. and Uhlenbeck, O. C. (1989). Determinationof RNA-Protein Contacts Using Thiophosphate Substitutions, Biochemistry28, 2849-2855].

After ligation, a reverse transcriptase is used to extend the 3′ ends ofthe fragments, utilizing the ligated oligoribonucleotide as a template,essentially as described by Sambrook et al. (1989). Use of an enzymethat lacks ribonuclease H activity is preferable [Kotewicz, M. L.,Sampson, C. M., D'Alessio, J. M. and Gerard, G. F. (1988). Isolation ofCloned Moloney Murine Leukemia Virus Reverse Transcriptase LackingRibonuclease H Activity, Nucleic Acids Res. 16, 265-277]. As in Example1.1.1, above, all reactions can be performed in one reaction mixture, inwhich case, no re-digestion to eliminate dimers is necessary, becauseRNA ligase cannot ligate double-stranded DNA fragments.

The extended strands are then melted apart, hybridized to a sortingarray and amplified there, as described in Example 1.1.1, above. For theextension of the immobilized oligonucleotide, however, reversetranscriptase is used instead of DNA polymerase, because reversetranscriptase can use both DNA and RNA as a template [Verma, I. M.(1981). Reverse Transcriptase, in The Enzymes, 3rd edition (P. D. Boyer,ed.), vol. 14, pp. 87-103, Academic Press, New York].

1.1.3. Method in which a Priming Region is Introduced by FragmentTailing with a Homopolynucleotide Sequence—

This method can be used where DNA is digested with a restrictionendonuclease whose recognition site can be restored by the addition ofonly one type of nucleotide. For example, DNA can be digested withrestriction endonuclease Aha III or Dra I, whose recognition site isTTTAAA. Cleavage occurs in the middle of the site, between T and Aresidues, leaving (5′)p-AAA . . . and . . . TTT-OH(3′) fragment termini.The restriction site is restored by extension of the 3′ ends withpoly(dA) through incubation with terminal deoxynucleotidyl transferase,essentially as described by Sambrook et al. (1989), in the presence ofonly one type of substrate, dATP. This produces . . .TTTAAAAAAAAAAAAAAAAAA-AAA . . . (3′), which serves as a priming regionfor the binding of a primer of the (5′)oligo(T)AAA-OH(3′) type. The 5′termini of the fragments are then extended by ligation tonon-phosphorylated oligo(dT) that is hybridized to the 3′ terminalextension. Detailed protocols for the addition of homopolymeric tailsand for oligonucleotide ligation to DNA fragments are given in Sambrooket al. (1989). After melting the extended strands apart, they arehybridized to a binary sorting array whose immobilized oligonucleotides'constant segment consists of (5′)oligo(T)AAA(3′). All other operationsare carried out as described in Example 1.1.1, above.

1.2. Sorting Restriction Fragments According to their TerminalSequences, with 3′ and 5′ Terminal Priming Regions Being Introduced, OneBefore and One After Strand Sorting—

This procedure consumes larger amounts of enzymes and substrates thanthe procedure described in Example 1.1, however, only those strands thatare correctly bound to the immobilized oligonucleotides acquire bothpriming regions necessary for PCR. Therefore, the possibility thatnon-specifically bound strands will be amplified, is minimized.Furthermore, using this procedure different priming regions can beintroduced at different termini of a strand. It then becomes possibleto: (1) perform “asymmetric” PCR, where only one of the complementarystrands is accumulated in significant amounts, and remains in asingle-stranded form; (2) introduce a transcriptional promoter into onlyone of the priming regions, in order to be able to obtain RNAtranscripts of only one strand (without also producing its complement asin conventional PCR); (3) differentially label complementary strands;and (4) avoid self-annealing of the strand's terminal segments that caninterfere with primer hybridization, therefore resulting in a lower PCRefficiency.

1.2.1. Method in which a Priming Region is Introduced at a RestrictionFragment Strand's 5′ End by Ligation to a Double-StrandedOligodeoxyribonucleotide Adaptor Before Sorting, and Another PrimingRegion is Introduced at the 3′ End After Sorting—

In this example, digestion of DNA, adaptor ligation and re-digestion offragments are carried out as described in Example 1.1.1, above. The 3′ends of the restriction fragments, however, are not extended byincubation with DNA polymerase. Instead, the strands ligated at their 5′ends to adaptors are melted apart from their unextended complements andthe strands are hybridized to a binary sorting array. The binary sortingarray contains immobilized oligonucleotides that are pre-hybridized withshorter complementary 5′-phosphorylated oligonucleotides that cover(mask) the immobilized oligonucleotides except for a segment whichincludes a variable region and a region complementary to the portion ofthe restriction site remaining at the fragments' (unrestored) 3′ end.The masked region includes the rest of the restriction site and anyother constant sequence, such as may be included in a priming region.Hybridization is carried out under conditions that promote the formationof only perfectly matched hybrids which are the length of the unmaskedsegment of the immobilized oligonucleotide. After washing away theunbound strands, the strands that remain bound are ligated to themasking oligonucleotides by incubation with DNA ligase. The correctlybound strands thus acquire a priming region at their 3′ end, in additionto the priming region they already have at their 5′ end. The two primingregions preferably correspond to different primers. The array is thenwashed under appropriately stringent conditions to remove all nucleicacids except the immobilized oligonucleotides and the ligated strandshybridized to them. A protocol for amplification can then be followed asdescribed in Example 1.1.1, above, starting with extension of theimmobilized primer by DNA polymerase, except that two different primers,rather than one universal primer, are used for PCR.

Using this procedure, only those strands that have been successfullyligated after sorting will be exponentially amplified during PCR, whileother strands, if some remain after washing, will not be amplified,because they are missing one of the two priming regions.

1.2.2. Method in which One Terminal Priming Region is Introduced at the5′ End by Ligation to a Single-Stranded Oligoribonucleotide AdaptorBefore Sorting Restriction Fragment Strands, and Another Priming Regionis Introduced at the 3′ End After Sorting—

A priming region is generated at the 5′ end of strands by fragmentligation to single-stranded oligoribonucleotides, as described inExample 1.1.2, above. The 3′ ends are not extended. Then the strands aremelted apart and hybridized to a binary sorting array as described inExample 1.2.1, above. Following ligation of the strands to the maskingoligonucleotides and subsequent washing, the immobilizedoligonucleotides are extended, and the covalently bound strands' copiesare amplified, as described in Example 1.1.2, above.

1.2.3. Method in which a 3′ Priming Region is Generated Before StrandSorting, and a 5′ Priming Region is Generated After Strand Sorting, BothExtensions Being Generated by Tailing the Strands with aHomopolynucleotide Sequence—

As in Example 1.1.3, above, the procedure in this example can be usedwhere DNA is digested with a restriction endonuclease whose recognitionsite can be restored by the addition of only one type of nucleotide. Inthis method, the 3′ termini of the DNA fragment strands are extended byincubation with terminal deoxynucleotidyl transferase. Unlike the methoddescribed in Example 1.1.3, however, ligation of a homooligonucleotideto the 5′ termini is omitted. Instead, the strands are melted apart andhybridized to a binary sorting array, and the immobilized primer isextended as described in Example 1.1.3. After synthesis of the boundstrand's complementary copy, all the material that is not covalentlylinked to the surface of the array is washed away, and the 3′ end of thecopy strand, which corresponds to the 5′ end of the original strand, isextended by incubation with terminal deoxynucleotidyl transferase, asdescribed above. PCR is carried out utilizing a primer that consists ofa 5′-terminal homooligonucleotide region that is complementary to thestrand's homopolymeric tail, and a 3′-terminal region that iscomplementary to the part of the restriction site which has beenrestored by addition of the tail. A potential drawback to this method isthat the strands acquire self-complementary terminal sequences. Thismethod has the advantage, however, that only covalently bound strandcopies receive the second priming region required for exponentialamplification by PCR.

1.3. Sorting Restriction Fragments According to their TerminalSequences, with Priming Regions at Both 3′ and 5′ Termini BeingIntroduced after Strand Sorting—

The procedure of this example provides the highest selectivity and thelowest background, because both the first and the second priming regionsare generated only if a strand has been specifically bound to theimmobilized oligonucleotides.

Unextended restriction fragments are melted into their constituentstrands which are then hybridized to a binary sectioned array havingimmobilized oligonucleotides that are masked over their constant region,except for the portion of the constant region complementary to thepartial restriction site remaining in the strands. The maskingoligonucleotides are 5′-phosphorylated. The hybridized strands are thenligated to the masking oligonucleotides to generate the first primingregion. Then, the immobilized oligonucleotides are extended, asdescribed in Example 1.2.1, above. After additional (more vigorous)washing, in a manner that destroys all hybrids that have not beenextended, the second priming region is generated in one of the followingways.

1.3.1. Method in which a Second Terminal Priming Region is Generated byLigation of the 5′ End of the Bound Strand to a Single-StrandedOligoribonucleotide Adaptor—

This procedure is performed utilizing T4 RNA ligase, essentially asdescribed for oligoribonucleotide ligation before sorting (see Example1.1.2, above). Then, the immobilized copy of the bound strand is furtherextended by incubation with reverse transcriptase, utilizing theoligoribonucleotide extension as a template. The material that is notcovalently bound is then washed away, and the strands that arecovalently bound are amplified by PCR, as described in Example 1.2.1,above.

1.3.2. Method in which a Second Terminal Priming Region is Generated byExtension of the Immobilized Copy with a Homopolymeric Tail—

A homopolymer tail is added by extending the immobilized strand copyusing the procedure described in Example 1.2.3, above. Two differentprimers, however, result because the first priming region in theimmobilized oligonucleotide can be of any sequence. As in Examples 1.1.3and 1.2.3, above, this method is applicable where the DNA is digestedwith a restriction endonuclease whose recognition site can be restoredby the addition of only one type of nucleotide.

1.4. Sorting of DNA Fragments that are not Bounded by RestrictionRecognition Sequences, According to their Terminal Sequences—

Such fragments can be obtained by DNA digestion with restrictionendonucleases whose recognition sequences are remote from their cleavagesites, or by a method that does not involve restriction endonucleases,such as by known enzymatic methods [e.g., Pei, D., Corey, D. R. andSchultz, P. G. (1990). Site-specific Cleavage of Duplex DNA by aSemisynthetic Nuclease via Triple-helix Formation, Proc. Natl. Acad.Sci. U.S.A. 87, 9858-9862; Zuckermann, R. N. and Shultz, P. G. (1989).Site-selective Cleavage of Structured RNA by a StaphylococcalNuclease-DNA Hybrid, Proc. Natl. Acad. Sci. U.S.A. 86, 1766-1770], or byknown chemical methods [e.g. Chen, C. H. and Sigman, D. S. (1986).Nuclease Activity of 1,10-Phenanthroline-copper: Sequence-specificTargeting, Proc. Natl. Acad. Sci. U.S.A. 83, 7147-7151; Fedorova, O. S.,Savitski, A. P., Shoikhet, K. G. and Ponomarev, G. V. (1990).Palladium(II)-coproporphyrin I as a Photoactivable Group inSequence-specific Modification of Nucleic Acids by oligonucleotideDerivatives, FEBS Lett. 259, 335-337]. Mixtures of relatively short DNAmolecules can also be obtained by other known methods (e.g., cDNAs). Inthis method, the priming regions added to the fragments' termini, aswell as the constant segments of the immobilized oligonucleotides, willnot generally include a restriction recognition sequence. Specificity ofhybridization and of priming at the fragments' termini is achieved bythe addition of adaptors; utilizing the method described in Examples 1.1to 1.3, above, which provide unique priming regions. The uniqueness ofthese priming regions can be checked in preliminary hybridizationexperiments.

The use of terminal deoxynucleotidyl transferase for the introduction ofhomopolymeric extensions has restricted applicability when the fragmenttermini do not possess restriction recognition sequences, becausehomopolymeric sequences frequently occur in genomes, and thereforehybridization and PCR priming would not always be confined to thefragment termini.

Some methods of DNA cleavage result in DNA “nicking”, rather than incleavage of the double-stranded fragments. Where nicking results fromthe fragmentation process, ligation of double-stranded adaptors is notpreferable. Also, chemical cleavage of DNA often results in theappearance of 5′-hydroxyl and 3′-phosphoryl groups, i.e., the oppositeof what is required for enzymatic ligation or extension. But preliminarydephosphorylation of 3′ termini (with a phosphatase, such as bacterialalkaline phosphatase or calf intestine alkaline phosphatase) and then(if necessary) phosphorylation of 5′ termini (with a kinase, such as thepolynucleotide kinase of bacteriophage T4) can be carried out, asdescribed by Sambrook et al. (1989). Alternatively, T4 polynucleotidekinase can be used for both phosphorylation of 5′ ends anddephosphorylation of 3′ ends of DNA, since this enzyme possesses both ofthose activities [Cobianchi, F. and Wilson, S. H. (1987). Enzymes forModifying and Labeling DNA and RNA, Methods Enzymol. 152, 94-110].

1.5. Isolation of Individual DNAs or DNA Fragments, by Sorting Accordingto their Terminal Sequences—

If the number of different DNA strands in a sample is rather smallrelative to the number of areas in a sorting array, there is a highprobability that, after one round of sorting on a sectioned binarysorting array, many wells in the sectioned array will either beunoccupied, or occupied by only one type of fragment. In the case of acomplex mixture of DNA strands, such as a mixture of strands obtainedupon restriction endonuclease digestion of an entire human genome, anumber of different types of fragments will occupy each well of asorting array having, e.g., 56,536 sections. In that case, the isolationof individual fragments is achieved by sorting each group of fragmentsfrom the first round of sorting on a second sectioned binary array. As aresult of PCR amplification (following generation of priming regions asdescribed above), each well (section) in the first array will containboth the original strands that were hybridized by their 3′ termini andcomplementary copies of the original strands. The complementary strandswill have 3′ termini that are complementary to the 5′-terminal sequencesof the original strands and will therefore be different from the 3′termini of the original strands. Thus, the complementary strands willbind to oligonucleotides in different wells within the new sectionedbinary array, and, with a high probability, each strand will occupy aseparate well, where it can then be amplified. As a result of thissecond round of sorting, almost all fragments will be separated from oneanother. In a diploid genome, however, virtually identical allelicfragments will, as a rule, accompany each other.

No matter which method of adding primers is utilized (see Examples 1.1to 1.3, above), after the first round of sorting, each strand willalready possess priming regions at both ends. Therefore each group ofsuch strands can be directly hybridized to a second binary sectionedarray having immobilized oligonucleotides thereon with a constantsequence complementary to the complementary strands' 3′ terminal primingregion. (The complementary strands' 3′-terminal priming region couldhave been made different from the 3′-terminal priming region of theoriginal strands.) The complementary strands can therefore be amplifiedby using the same primers as were used in the first round of sorting.This procedure is analogous to that described above for sorting strandsfollowing the generation of priming regions (see Example 1.1, above).

Alternatively, in order to ensure a higher degree of selectivity, thesecond sorting can be performed concomitant with the substitution of newpriming regions for the original priming regions. For example, ifrestriction sites were included in the priming region (and eliminatedfrom the fragments' internal regions), the old adaptors can be cleavedoff with a restriction endonuclease, thus regenerating the originalrestriction fragments, and new adaptors can be introduced, usingprocedures described in Examples 1.2 or 1.3, above.

There are a number of ways to use the second binary sorting arrayseconomically. First, smaller arrays having shorter variable sequencesthan in the first array can be employed. Second, because the number ofwells in a second sorting array will usually be much greater than thenumber of different strands in a well from the first array, one arraycan be used for the simultaneous sorting of strands from many of thewells in the first array. To prevent strands from different groupsinterfering with one another's isolation, their 3′ terminal sequences(“signature oligonucleotides”) can be surveyed prior to the secondsorting by using a binary survey array, as described in Example 5.1.4,below. The oligonucleotides of this latter array are comprised of avariable sequence and an adjacent constant sequence that iscomplementary to a part of the strands' 3′-terminal priming region(e.g., the terminal restriction site). After surveying the terminalsequences, groups of strands that would not interfere with one another'sseparation can be mixed together before the second round of sorting.Alternatively, the material from individual wells of the first array canbe delivered to particular addresses in the second array that have beendetermined from the results of the surveys.

Depending on the specific aim of this separation procedure, the strandscan be amplified by virtue of either standard (“symmetric”) PCR, or by“asymmetric” PCR [Gyllensten, U. B. and Erlich, H. A. (1988). Generationof Single-Stranded DNA by the Polymerase Chain Reaction and ItsApplication to Direct Sequencing of the HLA-DQA Locus, Proc. Natl. Acad.Sci. U.S.A. 85, 7652-7656; U.S. Pat. No. 5,066,584, incorporated hereinby reference]. Furthermore, RNA copies of the strands can be produced inthe wells by incubation with a DNA-dependent RNA polymerase, such as theRNA polymerases of T7, T3, or SP6 bacteriophages [Tabor, S. (1989).DNA-Dependent RNA polymerases, in Current Protocols in Molecular Biology(Ausubel, F. M. et al., eds.), vol. 1, pp. 3.8.1-3.8.4, John Wiley andSons, New York], if an RNA polymerase-specific promoter sequence isincluded in one of the two priming regions used for DNA amplification.

1.6. Sorting Selected Strands by their Terminal Sequences—

There may be applications where it is desired to isolate and analyzeonly some selected strands from a complex strand mixture. There are50,000 to 100,000 genes in the human genome, that together account forseveral percent of the genomic DNA, and that would be of primaryinterest for clinical diagnostics. Thus, it may be desirable that only100,000 or so fragments be isolated and analyzed, instead of millions ofrestriction fragments from the patient's entire genome. Instead ofpreparing an array that includes all possible variable oligonucleotidesegments of a certain length, a binary sorting array can be preparedthat contains selected oligonucleotides whose variable segments arechosen so that they match the termini of every fragment of interest, andonly the termini of those fragments, i.e. the segments are long enoughto isolate only the fragment of interest. Once the first human genome issequenced and all the genes are identified, and consequently all theaccessible restriction sites and their adjacent regions are known, itwill be possible to predict which restriction enzyme(s) would producefragments encompassing genes, and which oligonucleotide sequences ateach strand's terminus will be unique. For example, most of the 15-mericvariable segments would be unique for the human genome (together with anadjacent hexameric restriction site they would form effectiverecognition sequences that are 21 nucleotide long). At the same time,different oligonucleotides can be made of different lengths to ensurethat each would hybridize to only one type of strand. (Means of ensuringhighly selective hybridization conditions for every oligonucleotide inthe array, irrespective of its length and nucleotide composition, aredescribed above in Section II). An array that contains 100,000 selectedoligonucleotides with differing variable segments would be of virtuallythe same size as an array made of all possible variable octamers(65,536). If, for example, a human genome is digested with a chosenrestriction enzyme(s), and the strands are hybridized and amplifiedwithin such an array according to the methods described in Examples 1.1to 1.3, above, every fragment of interest will occupy a particular well,and will be essentially homogenous (except for minor sequencedifferences between allelic fragments, that will almost always possessidentical terminal sequences and will therefore almost always occupy thesame well). The fragments obtained (in either double-stranded orsingle-stranded form, depending on the type of amplification used, asdescribed in Example 1.5, above) can then be analyzed directly. Becausethe DNA sequence is substantially similar for every individual, exceptfor on average a few sequence differences per gene, it would besufficient, for most genes, to merely survey the oligonucleotide contentof the corresponding fragments (e.g., as described above in Section V),and to compare it to the genome sequences that have already beenestablished, to identify the sequence differences. Alternatively, onlysome chosen fragments can be analyzed, and the array can be then storedas a comprehensive permanent bank of all of the patient's genes, for usein subsequent analyses, if desired.

1.7. Sorting RNAs According to their Terminal Sequences—

1.7.1. Sorting of Eukaryotic mRNAs—

Mature eukaryotic mRNAs all share some structural features that can helpin their manipulation using arrays. All of these RNAs have a “cap”structure (such as a 7-methylguanosine residue attached to the RNA by a5′-5′ pyrophosphate bond) on their 5′ end, and most of the RNAs alsopossess a 3′-terminal poly(A) tail, which is attachedposttranscriptionally by a poly(A) polymerase. Because there are usuallyno long oligo(A) tracts in the internal regions of cellular RNAs, thepoly(A) tail can serve as a naturally occurring terminal primingsequence in the sorting procedure. The size of mRNAs (several thousandnucleotides in length) allows these sequences to be amplified andanalyzed directly, without prior cleavage into smaller fragments.

There are known methods for preparing essentially undegraded totalcellular RNA [Sambrook et al., 1989]. Residual amounts of degraded RNAcan be removed by treatment with a specific 5′-3′ exoribonuclease thatcompletely degrades uncapped RNAs while leaving capped RNAs intact[Murthy, K. G. K., Park, P. and Manley, J. L. (1991). A NuclearMicrococcal-sensitive, ATP-dependent Exoribonuclease Degrades Uncappedbut not Capped RNA Substrates, Nucleic Acids Res. 19, 2685-2692].

Total cellular RNA is converted into complementary DNA (cDNA) using anoligo(dT) primer and a reverse transcriptase (see Example 1.1.2, above)or Thermus thermophilus DNA polymerase (Tth DNA polymerase) [Myers, T.W. and Gelfand, D. H. (1991). Reverse Transcription and DNAAmplification by a Thermus thermophilus DNA Polymerase, Biochemistry 30,7661-7666]. Then, omitting second strand synthesis, single-strandedcDNAs (which possess oligo(dT) extensions at their 5′ end and variable3′ termini) are sorted according to their 3′-terminal oligonucleotidesegments on a sectioned binary array and are ligated there (followingthe procedure described in Example 1.2.1, above) to pre-hybridizedoligonucleotide adaptors of a predetermined sequence that arecomplementary to the immobilized oligonucleotides' constant sequence,and that introduce into a cDNA molecule the 3′-terminal priming site.The procedure described in Example 1.2.1 is followed to amplify thecDNA, using two primers for PCR amplification: oligo(dT) and anoligonucleotide that is complementary to the ligated adaptor.

Alternatively, cellular RNAs are directly hybridized according to theirpoly(A)-tailed 3′ termini to a sectioned binary array, whose immobilizedoligonucleotides' constant sequence is comprised of oligo(dT). Afterwashing away unbound RNAs, the immobilized primer is extended byincubation with a reverse transcriptase or Tth DNA polymerase, using thehybridized RNA as a template. The second priming site can be generated,and cDNA can then be amplified, by ligation of the 5′ termini of the RNAmolecules to oligoribonucleotide adaptors before sorting (as describedin Example 1.1.2, above) or after sorting (as described in Example1.2.2, above), or by extension of the immobilized DNA copies by theaddition of a 3′-terminal homopolymeric tail (as described in Example1.2.3, above). If oligoribonucleotide ligation to the 5′ end of the RNAis used, the 5′ terminal cap structure should first be removed byincubation with an appropriate enzyme, e.g., nucleotide pyrophosphatasefrom tobacco or potato cells, which results in the generation ofphosphorylated 5′ ends [Furuichi, Y. and Shatkin, A. J. (1989).Characterization of Cap Structures, Methods Enzymol. 180, 164-176]. Toovercome potential interference with ligation resulting from thepresence of RNA secondary structures, dimethylsulfoxide (up to 40% v/v)should be added to the ligation buffer to denature the RNA withoutappreciably decreasing the ligase activity (Romaniuk and Uhlenbeck,1983).

The result of the above procedure is sorted and amplified cDNAs of allcellular mRNAs. A typical mammalian (including human) cell containsbetween 10,000 and 30,000 different mRNA sequences [Davidson, E. H.(1976). Gene Activity in Early Development, 2nd edition, Academic Press,New York]. For example, if an oligonucleotide array made of variableoctamers is used (i.e., made of 65,536 different oligonucleotides) mostof the different types of cDNAs will be obtained in an individual state.Again, as in other applications, the final amplified product can besynthesized as either a double-stranded or a single-stranded DNA or RNA(as described in Example 1.5, above).

One of the most significant problems in preparing comprehensive cDNAlibraries is that the number of copies of different RNAs that occur in acell can differ by several orders of magnitude [Williams, J. G. (1981).The Preparation and Screening of a cDNA Clone Bank, in GeneticEngineering (R. Williamson, ed.), vol. 1, p. 1, Academic Press, London].Various rather complicated methods of enrichment of rare RNAs or theircDNAs in the sample are used to overcome this problem [Sambrook et al.,1989]. However, this problem does not arise if the above scheme isemployed and the RNAs are sorted into different wells. Exponentialamplification by PCR is allowed to continue until there is aleveling-off of the synthesis due to consumption of the substrates orprimers. Then there will be a roughly equal amount of DNA product ineach well, irrespective of the starting number of copies of a template.Put another way, PCR amplification using the invention results in anequalization in the number of cDNA copies of different cellular RNAs, nomatter whether they are abundant or rare to begin with, avoiding theproblem encountered with conventional cDNA library formation.

1.7.2. Sorting RNAs Lacking a 3′-Terminal Poly(A) Tail—

In this case, the 3′-terminal poly(A) tail can first be introduced byusing poly(A) polymerase [Sippel, A. E. (1973). Purification andCharacterization of Adenosine Triphosphate: Ribonucleic AcidAdenyltransferase from Escherichia coli, Eur. J. Biochem, 37, 31-40],with subsequent steps essentially identical to those described inExample 1.7.1, above. If RNAs are sorted directly (i.e. without firstsynthesizing cDNA), the 5′-terminal priming regions are preferablyintroduced through ligation of RNA by incubation with RNA ligase to anon-phosphorylated oligoribonucleotide. For ligation to occur, the 5′terminus of the RNA must be phosphorylated; if not, phosphorylationshould be performed by using polynucleotide kinase (as described inExample 1.4, above). If the 5′ end is blocked by a triphosphate group(as in most prokaryotic RNAs), it should first be dephosphorylated bytreatment with a phosphatase.

Alternatively, RNAs can be sorted according to their 3′-terminaloligonucleotide segments on a sectioned binary array and ligated thereto pre-hybridized oligodeoxynucleotide adaptors, following the proceduredescribed in Example 1.2.1, above. It has been shown that T4 DNA ligaseefficiently joins the 3′-hydroxyl group of RNA to the 5′-phosphorylgroup of DNA in mixed duplexes [Nath, K. and Hurwitz, J. (1974).Covalent Attachment of Polyribonucleotides to PolydeoxyribonucleotidesCatalyzed by Deoxyribonucleic Acid Ligase, J. Biol. Chem. 249,3680-3688; Selsing, E. and Wells, R. D. (1979). Polynucleotide BlockPolymers Consisting of a DNA:RNA Hybrid Joined to a DNA:DNA Duplex.Synthesis and Characterization of dG_(n):rC₁dC₁ Duplexes, J. Biol. Chem.254, 5410-5416].

2. Sorting Nucleic Acids or their Fragments by their Internal Sequences

2.1. Sorting DNA Strands by their Internal Sequences on a Binary Array,According to a Combination of a Variable Oligonucleotide Segment and anAdjacent Restriction Site—

This procedure can be used, for example, for sorting strands beforesurveying fragment signatures, to ascertain which sequenced fragmentsneighbor each other within a longer DNA. The purity of the sortedstrands (i.e., free from contaminating irrelevant strands) is not ascritical for this purpose as it is in sequencing. The only requirementis that the number of copies of each contaminating strand be low enough,compared with the number of copies of legitimately bound strands, thatthe hybridization signals that the legitimate strands produce indifferent areas of a signature survey array be reliably distinguishablefrom the signals produced by irrelevant strands.

2.1.1 Addition of Both Priming Regions at the Same Time—

After a DNA sample has been digested with a restriction endonuclease,terminal priming regions are added to both ends of the fragment strandsby one of the methods described in Examples 1.1.1 to 1.1.3, above. Thenthe strands are melted apart and hybridized to a sectioned binary arraywhose immobilized oligonucleotides are comprised of a variable segmentand a constant segment, the latter being complementary to a preselectedrestriction recognition sequence occurring in the DNA. If the procedureis performed to order previously sequenced restriction fragments, it ispreferable that the constant segment be complementary to the recognitionsequence of the restriction endonuclease used to produce the sequencedfragments. The oligonucleotides immobilized on in the array can haveeither end free, however, free 3′ ends are preferable. In that case,after washing away the unbound strands, the immobilized oligonucleotidesare preferably extended, using the bound strands as templates. Thelength of minimally extended hybrids of short strands will increase bythe length of the sequence introduced at the fragment's 5′ end,resulting in an increase in the melting temperature of the extendedhybrids. The array is then washed under much more stringent conditionsin which the only bound strands that remain are those that arehybridized to extended immobilized oligonucleotides. The wells in thearray are then filled with a solution containing universal primer, anappropriate DNA polymerase, and the substrates and buffer needed tocarry out a polymerase chain reaction. The array is then sealed,isolating the wells from each other, and exponential amplification iscarried out in each well of the array. If the oligonucleotides in thearray are linked to the surface by their 3′ ends, the oligonucleotideextension step, as well as the second washing, is omitted.

2.1.2 Addition of Two Different Priming Regions in Separate Steps—

In this method, the priming regions on the 3′ and 5′ ends of the strandsare generated in two steps: 5′ priming regions are introduced byligation to either a double-stranded oligodeoxyribonucleotide adaptor(described in Example 1.2.1, above), or by ligation to a single-strandedoligoribonucleotide adaptor (as described in Example 1.2.2, above),whereas the 3′ priming regions are introduced by extending the strands'3′ termini through the addition of a homopolymeric tail (as described inExample 1.2.3, above) after the 5′-terminally ligated strands have beenmelted apart and hybridized to a 3′ or 5′ binary array. (The order of3′-terminal extension and 5′-terminal ligation to oligoribonucleotidescan be reversed). The rest of the procedure is identical to thatdescribed in Example 2.1.1, above, with immobilized oligonucleotideextension and second washing being preferably included when theoligonucleotides are linked to the surface by their 5′ ends.

2.2. Sorting Nucleic Acid Strands by their Internal Sequences on anOrdinary Array—

This method can be used, for example, for sorting nucleic acids intogroups of strands that share some sequence motif, or for isolatingindividual strands that contain unique sequence segments. The array canbe 5′ or 3′. The oligonucleotides need not contain a constant segment,and can, if desired, be rather long. The array can contain only selectedoligonucleotides, whose sequence and length can be different from oneanother (rather than being a comprehensive array). Both DNAs and RNAscan be sorted, essentially following the procedures described in Example2.1, above. In the case of RNA, a preferred scheme includes, first,addition of a poly(A) tail to the 3′ end (if it is not present there) byincubation with a poly(A) polymerase, and then, after hybridization ofthe strands and extension of the immobilized oligonucleotides by areverse transcriptase, the ligation of the 5′ end of the RNAs to anoligoribonucleotide adaptor.

3. Preparing Partial Strands of Nucleic Acids on Oligonucleotide Arrays

There are two aspects to this procedure: first, the generation ofpartial strands, and second, the sorting of the partial strands intogroups according to the identity of their terminal oligonucleotidesegments. In one embodiment these two aims are achieved in a singlestep. Preparing partials has steps in common with strand sorting,described above; however, in strand sorting it is desirable to preservethe original strand intact, and to amplify precise copies of theoriginal strand, whereas in preparing partials, truncated copies of theoriginal strand are produced. All of the embodiments described below inthis section are based on the following principle: in generatingpartials from a strand, one of the original strand ends is preserved (itwill be referred to as the “fixed” end), whereas the other end istruncated to a different extent in the various partials (it will bereferred to as the “variable” end). Although either the 5′ or the 3′ endof the original strand can serve as the fixed end, it is preferable thatthe 5′ end be fixed. If amplification of sorted partials is desirable,it is preferable that the 5′ end of the original strand, i.e., the fixedend, be provided with a priming region prior to strand partialing, byany of the methods described above and that the partialing be carriedout on a sectioned array. Either an individual strand, or a mixture ofstrands can be subjected to a partialing procedure; however, if themixture is very complex (such as a restriction digest of a largegenome), it is desirable that the mixture first be sorted into lesscomplex groups of strands, as described above. The groups of strandsused for preparing partials should essentially be devoid ofcontaminating strands; therefore, sorting by terminal sequences ispreferable for the preliminary sorting of strands. If preliminarysorting of strands is performed, the strands will already contain theterminal priming regions necessary for amplification of the partials. Aswith sorting, partialing can be performed on either DNA or RNA, thefinal product being either DNA or RNA, in either a double-stranded or asingle-stranded state.

3.1. Methods Employing Enzymatic Cleavage of DNA Fragments—

The purpose of the cleavage is to produce a set of partials of everypossible length; therefore, DNA should be cleaved as randomly aspossible, and to the extent that there is approximately one cut perstrand. The extent of cleavage is determined by the enzymeconcentration, temperature, and duration of incubation. Optimal reactionconditions can be determined in preliminary experiments for a givenrange of strand lengths.

3.1.1. Utilizing Double-Strand-Specific Deoxyribonucleases for CleavingDouble-Stranded DNA Fragments—

Deoxyribonuclease I from bovine pancreas (DNase I) cleaves bothdouble-stranded and single-stranded DNA; however, double-stranded DNA ispreferable as the starting material for preparing partials because ofits essentially homogeneous secondary structure, so that every segmentof a DNA molecule is equally accessible to cleavage. Double-stranded DNAfragments are produced as a result of “symmetric” PCR that can becarried out when sorting strands (as described in Example 1.2, above).An advantage of using DNase I is that it produces fragments with5′-phosphoryl and 3′-hydroxyl termini, that are suitable for enzymaticligation.

Cleavage of DNA by DNase I is not perfectly random under standardconditions; for example, DNase I cleaves phosphodiester bonds that are5′ from a deoxythymidine more frequently than other bonds [Laskowski,M., Sr. (1971). Deoxyribonuclease I, in The Enzymes, 3rd edition (P. D.Boyer, ed.), vol. 4, pp. 289-311, Academic Press, New York]. The bias ofDNase I for cleaving at certain nucleotides is largely eliminated if thereaction buffer contains Mn⁺⁺ instead of Mg⁺⁺ [Anderson, S. (1981).Shotgun DNA Sequencing Using Cloned DNase I-generated Fragments, NucleicAcid Res. 9, 3015-3027]. Moreover, the preference of DNase I forparticular nucleotides can be either increased or decreased in apredictable way by including transition metal ions, such as Cu⁺⁺ orHg⁺⁺, in the incubation buffer [Clark, P. and Eichhorn, G. L. (1974). APredictable Modification of Enzyme Specificity. Selective Alteration ofDNA Bases by Metal Ions to Promote Cleavage Specificity byDeoxyribonuclease, Biochemistry 13, 5098-5102]. Thus, DNA cleavage byDNase I can be made essentially random by manipulating the content ofdifferent transition metal ions in the reaction medium.

Another way to make cleavage more random is to use mixtures of differentdeoxyribonucleases, whose spectra of nucleotide specificity complementone another. For example, the nucleotide specificity spectrum of neutralDNase from crab (Cancer pagurus) testes is essentially complementary tothat of DNAse I; moreover, this DNase also produces 5′-phosphoryl and3′-hydroxyl termini [Bernardi, A., Gaillard, C. and Bernardi, G. (1975).The specificity of Five DNases as Studied by the Analysis of 5′-TerminalDoublets, Eur. J. Biochem. 52, 451-457].

The exact composition of the reaction mixture should be found inpreliminary experiments with a terminally labeled DNA. The cleavageshould result in a “ladder” of bands of essentially equal intensity whenseen after polyacrylamide gel electrophoresis under denaturingconditions (Sambrook et al., 1989).

After cleavage of the double-stranded DNA fragments, DNase is removed,e.g., by phenol extraction (Sambrook et al., 1989). The (partial)strands are then melted apart and are hybridized to a sectioned binaryarray, wherein the immobilized oligonucleotides are pre-hybridized withshorter complementary 5′-phosphorylated oligonucleotides of a constantsequence that cover (mask) the immobilized oligonucleotides except for asegment that consists of a variable sequence. Hybridization is carriedout under conditions that favor the formation of perfectly matchedhybrids of a length that is equal to the length of the unmasked(variable) segment of the immobilized oligonucleotide, and that minimizethe formation of imperfectly matched hybrids. After washing away unboundstrands, the strands that remain bound are ligated to the maskingoligonucleotides by incubation with a DNA ligase. The ligated maskingoligonucleotides will themselves serve as the second (3′-terminal)priming region of a partial strand. (All the partials of a strand willshare the same 5′ priming sequence that had been introduced into thestrand before generation of the partials). If restriction fragments areto be partialed that possess some restriction site at their termini anddo not possess this site internally, it is preferable that the 3′terminal priming region added to the partials include that site. Thisincreases the specificity of terminal priming during subsequentamplification of the partials by PCR. Subsequent extension, washing, andamplification steps are as described in Example 1.1.1, above, for sortedstrands. If the partials are prepared for the purpose of sequencedetermination, asymmetric PCR can be performed. Asymmetric PCR resultsin only one of the complementary strands of each partial beingaccumulated in significant amounts. Alternatively, an RNA polymerasepromoter sequence can be included in one of the two primers, andamplified DNA is then transcribed to produce multiple single-strandedRNA copies of one of the two complementary partial strands (as describedin Example 1.5, above).

As is the case for strand sorting, covalently bound (complementary)copies of each partial strand will be generated within the array, thecopy of each type of partial being present at a known location;therefore, the array can be stored as a permanent record of allgenerated partials. It can be used repeatedly for the synthesis ofadditional copies of the partial strands.

If two different primers are used to amplify the full-length strandsbefore the generation of partials (e.g., during a strand sortingprocedure), then complementary strands will possess different primingsequences at their 5′ termini, which are preserved during strandpartialing. Therefore, depending on the combination of primers usedduring partial strand amplification, the partials that originate fromeither of the complementary strands, or from both of them, will beamplified. For example, if the primer sequences that are present at the5′ ends (fixed ends) of complementary strands prior to the generation ofpartials are “a” and “b”, and if after the generation of partials primer“c” is added to the truncated 3′ ends (variable ends), then the presenceof primers a and c in the amplification reaction will result in thesynthesis of one set of partials, while the presence of primers b and cwill result in the synthesis of the other set of partials. Thus, afterpartials of one strand in each complementary pair have been amplified byutilizing an appropriate pair of primers, the samples are withdrawn, thearray is washed, and then partials of the complementary strands can beamplified by employing a different pair of primers.

3.1.2. Utilizing Single-Strand-Specific Endonucleases for CleavingSingle-Stranded DNA Fragments—

This method can be used for cleaving both single-stranded DNA, anddouble-stranded DNA, after the latter is denatured (i.e. melted intoconstituent complementary strands). The best choice for cleavage is, atpresent, nuclease S1 from Aspergillus oryzae, that cleavessingle-stranded regions in both DNA and RNA, producing fragments with5′-phosphoryl and 3′-hydroxyl termini. Cleavage is essentiallynon-specific with respect to nucleotide sequence. There may be, however,problems with the cleavage of double-stranded regions that occur assecondary structures in a single-stranded nucleic acid, because thesedouble-stranded regions are resistant to attack by this nuclease. Thesolution for this problem lies in the stability of the nuclease at hightemperatures (it remains active at temperatures as high as 65° C.), atlow ionic strength, and at rather high concentrations of many denaturingagents (even 50% formamide is tolerable) [Shishido, K. and Ando, T.(1982). Single-strand-specific Nucleases, in Nucleases (S. M. Linn andR. J. Roberts, eds.), pp. 155-185, Cold Spring Harbor Laboratory Press,Cold Spring Harbor, N.Y.]. Under these conditions, secondary structureelements are either destroyed or significantly destabilized. The stepsthat follow DNA cleavage are essentially the same as described inExample 3.1.1, above, except that fragment melting is omitted.

3.1.3. Utilizing Exonucleases for Cleaving PartiallyPhosphorothioate-Substituted Nucleic Acid Strands—

An intrinsically random method of preparing partials, that is notdependent on the existence of nucleic acid secondary structures and thatproduces fragments whose termini are suitable for ligation, is carriedout using α-phosphorothioate analogs of natural nucleotides. Thesenucleotide analogs are incorporated into DNA strands by DNA polymerase,and the phosphorothioate internucleotide linkages that are formed arefully resistant to cleavage by a 3′-5′ exonuclease III, so thatexonucleolytic cleavage from the 3′ end of a strand stops immediatelydownstream of the first phosphorothioate bond [Putney, S. D., Benkovic,S. J. and Schimmel, P. R. (1981). A DNA Fragment with anα-Phosphorothioate Nucleotide at One End Is Asymmetrically Blocked fromDigestion by Exonuclease III and Can Be Replicated in vivo, Proc. Natl.Acad. Sci. U.S.A. 78, 7350-7354]. Partials of every possible length aregenerated, as described by Labeit at al., except that all fourphosphorothioate analogs are present in one reaction [Labeit, S.,Lehrach, H. and Goody, R. S. (1986). A New Method of DNA SequencingUsing Deoxynucleoside α-Thiotriphosphates, DNA 5, 173-177]. Theprocedure described in Example 3.1.1, above, is then followed.

3.2. Methods Employing Chemical Degradation of DNA—

These methods are applicable to both double-stranded and single-strandednucleic acids. Chemical degradation is, in most cases, essentiallyrandom, because it can be performed under conditions that destroysecondary structure, and because of the small size of the modifyingchemicals, making the chemicals readily accessible to the nucleotidesthat are involved in secondary structures.

3.2.1. Chemical Degradation of DNA Strands Containing NaturalNucleotides—

Both base-nonspecific reagents [Cartwright, I. L. and Elgin, S. C. R.(1982). Analysis of Chromatin Structure and DNA Sequence Organization:Use of the 1,10-Phenanthroline-cuprous Complex, Nucleic Acids Res. 10,5835-5852; Cartwright, I. L., Hertzberg, R. P., Dervan, P. B. and Elgin,S. C. (1983). Cleavage of Chromatin with Methidiumpropyl-EDTA•iron(II),Proc. Natl. Acad. Sci. U.S.A. 80, 3213-3217; Kobayashi, S., Ueda, K.,Morita, J., Sakai, H. and Komano, T. (1988). DNA Damage Induced byAscorbate in the Presence of Cu²⁺ , Biochim. Biophys. Acta 949, 143-147;Reed, C. J. and Douglas, R. T. (1991). Chemical Cleavage of Plasmid DNAby Glutathione in the Presence of Cu(II) Ions. The Cu(II)-thiol Systemfor DNA Strand Scission, Biochem. J. 275, 601-608] and base-specificreagents [Maxam, A. M. and Gilbert, W. (1980). Sequencing End-labeledDNA with Base-specific Chemical Cleavages, Methods Enzymol. 65, 499-560]can be used. In the latter case, after base-specific cleavage isperformed separately with several portions of the sample, the portionsare mixed together to form a set of all possible partial DNA lengths.

The main drawback to chemical cleavage is that the location of theterminal phosphate groups on the fragments is opposite to what isrequired for enzymatic ligation: 5′-hydroxyl and 3′-phosphoryl groupsare produced in most cases. To overcome this problem, enzymaticdephosphorylation of 3′ ends can be carried out (as described in Example1.4, above). Alternatively, (complementary) partial copies, that coverthe distance included between the 3′ termini of the original strands andthe cleavage sites, can be produced in a linear fashion by incubationwith a DNA polymerase. In this case, primer(s) complementary to the3′-terminal priming region(s) should be used. The product strands willthen possess the 3′-terminal hydroxyl groups necessary for ligation tomasking oligonucleotides in the array. Subsequent steps for obtainingsorted partials are then carried out (as described in Example 3.1.1,above).

3.2.2. Cleavage of DNA Strands whose Natural Nucleotides are Substitutedwith their α-Phosphorothioate Analogs—

This method is based on the technique developed by Gish and Eckstein forsequencing nucleic acids. In their approach, four different DNA (or RNA)polymerization reactions are carried out, in each reaction one of thefour natural nucleoside triphosphates is replaced with the correspondingα-thiotriphosphate nucleoside analog. The full-length product strandsthus produced are treated with alkylating agents, such as 2-iodoethanolor 2,3-epoxy-1-propanol, producing phosphorothioate triesters that aremore susceptible to hydrolysis than natural phosphodiester bonds.Hydrolysis mainly results in desulphurization, with regeneration of thenatural phosphodiester bond, but some cleavage of the nucleic acidstrand occurs. This cleavage occurs randomly along the strand, and doesnot depend on whether or not the corresponding region is involved in asecondary structure [Gish, G. and Eckstein, F. (1988). DNA and RNASequence Determination Based on Phosphorothioate Chemistry, Science 240,1520-1522; Nakamaye, K. L., Gish, G., Eckstein, F. and Vosberg, H. P.(1988). Direct Sequencing of Polymerase Chain Reaction Amplified DNAFragments through the Incorporation of Deoxynucleosideα-Thiotriphosphates, Nucleic Acid Res. 16, 9947-9959].

In order to generate all possible partials taking advantage of thisapproach, a DNA sample is pre-amplified in the presence ofα-phosphorothioate substrates. This can be done during a strand sortingprocedure as described above. In contradistinction to the originalmethod of Gish and Eckstein, all four α-phosphorothioates are presenttogether, in one reaction mixture. Subsequent treatment with iodoethanolresults in random cleavage of the DNA strands. The desired extent ofcleavage can be achieved both by appropriately controlling alkylationconditions, and by varying the proportion of natural substrates to theirphosphorothioate analogs during DNA synthesis. Since cleavage results ina mixture of 3′-hydroxyl and 3′-phosphoryl termini (Gish and Eckstein,1988), removal of 3′ phosphates with a phosphatase is preferably carriedout (as described in Example 1.4, above) before the partials are sorted(as described in Example 3.1.1, above).

3.3. Method of Preparing Partials Directly on a Sectioned Array, withoutPrior Degradation of Nucleic Acids—

In this embodiment, the generation of partials and their sortingaccording to the identity of the sequences at their variable ends occuressentially in one step. First, a strand or a group of strands (ifdouble-stranded nucleic acid is used as a starting material, thecomplementary strands are first melted apart), is directly hybridized toa sectioned ordinary array, whose oligonucleotides only comprisevariable sequences of a pre-selected length, and that are immobilized onthe surface of the array by their 5′ termini. Optimally, hybridizationis carried out under conditions in which hybrids can only form whoselength is equal to the length of the immobilized oligonucleotide. Eachstrand is able to bind to many different locations within the array,dependent on which oligonucleotide segments are present in its sequence.If the array is comprehensive, then a hybrid is formed somewhere withinthe array for every oligonucleotide that occurs in a DNA's sequence.After hybridization, the entire array is washed and incubated with anappropriate DNA polymerase in order to extend the immobilizedoligonucleotide, using the hybridized strand as a template. Each productstrand is a partial (complementary) copy of the hybridized strand. Eachpartial begins at the place in the strand's sequence where it has beenbound to the immobilized oligonucleotide and ends at the priming regionat the 5′ terminus of the strand. (If a priming region has not beenintroduced at the strand's 5′ end before partialing, it can be generatedat this step, after the hybrids that have not been extended, areeliminated by washing. This can be done either by ligating the 5′ end ofthe bound strand to a single-stranded oligoribonucleotide adaptor, asdescribed in Example 1.3.1, above, or by tailing the immobilized partialcopy with a homopolynucleotide, as described in Example 1.3.2, above).The entire array is then vigorously washed under conditions that removethe original full-length strands and essentially all other material thatis not covalently bound to the surface. Subsequent amplification of theimmobilized partials can be carried out in different ways, dependent onwhether it is desired to use linear amplification (which produces DNA orRNA copies of each partial), or exponential amplification (which is ableto produce a much larger number of DNA copies).

3.3.1. Linear Copying of Partial Strands—

Linear copying results in only generating copies of partials of theparental strand and not complementary copies. This may be advantageousin analyzing the results of a subsequent survey of the oligonucleotidecontent of the partials. Linear copying takes advantage of the presenceof the priming region on the 3′ end of the immobilized partials (that iscomplementary to the 5′-terminal priming region of the original strand).If DNA copies are desired, a thermostable DNA polymerase and a primerthat is complementary to that priming region, should be used. After thearray is sealed to isolate individual wells from each other, temperaturecycling is performed as in PCR. RNA copies can be produced by employingan RNA polymerase (as described in Example 1.5, above); in which case,the priming region should contain an appropriate promoter sequence.Linear amplification of partials in the form of RNA does not requiretemperature cycling and is more effective, since at least 700full-length RNA copies can be produced from each DNA template with T7RNA polymerase [Weitzmann, C. J., Cunningham, P. R. and Ofengand, J.(1990). Cloning, in vitro Transcription, and Biological Activity ofEscherichia coli 23S Ribosomal RNA, Nucleic Acids Res. 18, 3515-3520].An advantage of the linear copying of the partials prepared by themethod of this embodiment is the absence of a priming region at the 3′end of the copies produced that could otherwise interfere with certainuses of the partials discussed below.

3.3.2. Exponential Amplification of Partial Strands—

Exponential copying results in the generation of partials, and theircomplements. For a strand to be exponentially amplified by PCR, both ofits termini should be provided with a priming region, preferablydifferent priming regions. The immobilized (complementary) partial(obtained by extension of the immobilized oligonucleotide) contains onlyone (3′-terminal) priming region, and a complementary copy produced bylinear copying would also have only one priming region (on its 5′ end).For RNA copies to have a priming region at their 5′ ends, theimmobilized partial copy should have been provided with an RNApolymerase promoter downstream of its 3′ terminal priming region usingthe methods described herein. The second priming region that is neededfor exponential amplification can be introduced at the 3′ ends of thecomplementary copies as follows.

(a) The 3′ termini of RNA copies can then be ligated tooligoribonucleotide or oligodeoxyribonucleotide adaptors which arephosphorylated at their 5′ end and whose 3′ end is blocked [Romaniuk, P.J. and Uhlenbeck, O. C. (1983). Joining of RNA Molecules with RNALigase, Methods Enzymol. 100, 52-59; Uhlenbeck, O. C. and Gumport, R. I.(1982). T4 RNA Ligase, in The Enzymes, 3rd edition (P. D. Boyer, ed.),vol. 15, pp. 31-58, Academic Press, New York]. Exponential PCRamplification can then be performed by utilizing the two primers thatcorrespond to the two priming regions, and then incubating with Tth DNApolymerase (Myers and Gelfand, 1991).

(b) If the amplified copies are DNA molecules, they can be transferred,such as by blotting, (after melting them free of the immobilizedpartial) onto a binary array that is a mirror copy of the first array inthe arrangement of the variable segments of its immobilizedoligonucleotides. The constant segments of this binary array arepre-hybridized to masking oligonucleotides whose ligation to the 3′termini of the transferred DNAs (by DNA ligase, such as described inExample 1.2.1, above) results in generation of the second primingregion. Exponential PCR amplification can then be performed by utilizingthe two primers that correspond to the two priming regions, and anappropriate DNA polymerase.

In methods (a) and (b), both priming regions preferably contain, whenapplicable, the recognition sequence of the restriction endonucleasethat was used to digest the genomic DNA before full-length strandsorting, and which had thus been substantially eliminated from thestrands' internal regions.

(c) The priming region at the 3′ terminus of a DNA or RNA copy can beintroduced by extension of the terminus with a homopolymeric tail byincubation with terminal deoxynucleotidyl transferase or poly(A)polymerase, respectively. The complementary homooligonucleotide can bethen used during PCR to prime from this region. This method, however,may not be desirable, since, similar homopolymeric stretches may occursomewhere within the partial, and the corresponding shortened sequencewould then also be amplified.

(d) If partials are surveyed and it is desirable to detect only thoseoligonucleotides that occur in one complementary strand (such asdetecting only parental oligonucleotides), then either only one of thetwo different primers should be labeled, or the primers should belabeled differently. It is also possible to use labeled substratesduring asymmetric PCR.

3.4. Partialing RNAs—

A 3′-poly(A)-tailed RNA can be converted into a cDNA (such as describedin Example 1.7.1, above), after which any method described above forpartialing DNA, can be applied. Alternatively, RNAs can be partialeddirectly. Prior to partialing, a 5′-terminal priming region should beintroduced into RNAs (such as described in Example 1.7, above).

3.4.1. RNA Partialing by Enzymatic Degradation—

As with DNA, single-stranded RNA is cleaved by nuclease S1 randomly andin a sequence-nonspecific manner, but double-stranded secondarystructure elements are essentially resistant to nuclease attack (seeExample 3.1.2, above). Ribonuclease V1 from cobra venom, however,perfectly complements nuclease S1 by cleaving RNA mainly withindouble-stranded regions, and is also sequence-nonspecific [Vasilenko, S.K. and Ryte, V. C. (1975). Isolation of Highly Purified Ribonucleasefrom Cobra (Naja oxiana) Venom, Biokhimia (Moscow) 40, 578-583], so thatby preparing mixtures of these enzymes an essentially uniform cleavageof an RNA strand can be obtained. 5′-phosphoryl and 3′-hydroxyl terminiare produced upon action of either of these enzymes. If adouble-stranded RNA is used as a starting material, it can be randomlycleaved by incubation with ribonuclease V1 alone.

A priming region can be introduced into the newly formed 3′ hydroxyltermini of RNA partial strands in solution, either by addition of apoly(A) tail by incubation with poly(A) polymerase, or by ligation of anoligonucleotide adaptor by incubation with RNA ligase in solution (suchas described in Example 3.3.2, above). Then the partials are hybridizedto a sectioned binary array of oligonucleotides whose constant segmentis complementary to the 3′-terminal extension of the fragments.Alternatively, the 3′-terminal priming region can be introduced byligation of RNA partials to a masking oligonucleotide on a sectionedbinary array (such as described in Example 1.7.2, above). Theimmobilized oligonucleotides are then extended by incubation withreverse transcriptase or Tth DNA polymerase, the array is washed toremove non-covalently bound material, and the immobilized partials areamplified, such as by methods described in Example 3.1.1, above).

3.4.2. RNA Partialing by Chemical Degradation—

Although there are many methods for chemical degradation of RNA, thesimplest methods are alkaline hydrolysis [Donis-Keller, H., Maxam, A. M.and Gilbert, W. (1977). Mapping Adenines, Guanines, and Pyrimidines inRNA, Nucleic Acids Res. 4, 2527-2538] and RNA hydrolysis withMg⁺⁺-formamide [Diamond, A. and Dudock, B. (1983). Methods of RNASequence Analysis, Methods Enzymol. 100, 431-453], that produce a fairlyuniform ladder of different-length RNA bands when examined byelectrophoresis through a denaturing polyacrylamide gel. As with DNA,chemical degradation results in fragments bearing 3′-phosphoryl groupsthat should be removed by incubation with a phosphatase (as described inExample 1.4, above), after which the procedure described in Example3.4.1, above, is followed.

3.4.3. RNA Partialing Directly on an Oligonucleotide Array—

This is carried out as described for DNA (in Example 3.3, above), thedifference being that a reverse transcriptase (or a DNA polymerase thatcan copy RNA, such as Tth DNA polymerase) is used for the extension ofthe immobilized oligonucleotides. Thus, DNA partials of the RNA strandsare generated.

4. Uses of Sectioned Oligonucleotide Arrays for Manipulating NucleicAcids

In the examples described below, it is assumed that the sequences of thenucleic acids to be manipulated have already been established either bythe method of the invention, or by some other technique. Therefore, itis assumed that sequence analysis has preceded the manipulationsdescribed here. Since the sequence of the nucleic acid sample is alreadyknown, it is not necessary, in these manipulations, that the sample bedistributed across the entire array. Instead, a sample can be delivereddirectly to the well in the array where a particular oligonucleotide (ora particular strand) is immobilized. Other wells in the array can beeither left unused, in a particular procedure, or, preferably, used tocarry out similar reactions in parallel. In these uses, the arrays canserve as a universal tool, enabling a very large number of specificallydirected manipulations of nucleic acids to be carried out using astandard set of supplies, without recourse to synthesis of newoligonucleotides for each manipulation.

4.1. Isolation of Individual Partial Strands—

4.1.1. Separation of Partials that Share the Same Variable TerminalOligonucleotide, but Originate from Different Strands—

Partials sharing the same terminal oligonucleotide, but that originatefrom different strands possess, as a general rule, different sequencesat their fixed ends (assuming that the fixed ends were not used forstrand sorting). Therefore, individual partials almost always can beisolated from each other by sorting according to the terminaloligonucleotides at their fixed ends using arrays as described above.

4.1.2. Separation of Partials that Share the Same Variable TerminalOligonucleotide and Originate from the Same Strand—

If an address oligonucleotide occurs in a strand more then once, therewill be two or more partials of different length in the same well whichpossess not only identical-3′-terminal oligonucleotides (assuming thevariable end is the 3′ end), but also identical 5′-terminaloligonucleotides. These partials can, of course, be separated by size,utilizing known gel-electrophoresis techniques (Sambrook et al., 1989).Even in this case, however, separation can be performed by usingsectioned oligonucleotide arrays.

For example, there may be three identical oligonucleotides “P” in astrand, which are numbered, according to the order of their appearancein the parental strand in the 5′ to 3′ direction, P₁, P₂, and P₃.Accordingly, in the well where an oligonucleotide complementary to P isimmobilized, three partials of different length are generated from theoriginal strand, among which partial 1 is the shortest, and partial 3 isthe longest. The method described below results in isolation of each ofthese three partials from one another.

Where the longest partial contains an oligonucleotide that does notoccur in the shorter partials (i.e., an oligonucleotide that occursbetween oligonucleotides P₂ and P₃, but does not occur upstream of P₂),its isolation is straightforward: the mixture is hybridized to a wellcontaining the complementary oligonucleotide, wherein only the longestpartial can bind.

For isolation of the shorter partials, a different (though similar)method is required, since any oligonucleotide that occurs in a shorterpartial is also contained in a longer one. To prepare shorter partials,we first prepare a chosen partial from the parental strand, with adifferent variable terminus (i.e., not P). For example, to preparepartial 1 (the shortest partial), first a longer partial is prepared(using the technique described above) whose 3′-terminal oligonucleotidelies between oligonucleotides P₁ and P₂, but does not occur downstreamof P₂. This is easily determined from an examination of the knownsequence of the strand. Partial 1 is the only partial with 3′-terminaloligonucleotide P, that is prepared by partialing the truncated strand,and isolating the partial whose terminal oligonucleotide is P. Toprepare partial 2 (of intermediate size), a partial is first preparedwhose 3′-terminal oligonucleotide lies between P₂ and P₃, and does notoccur downstream of P₃. From this partial, two partials are thengenerated with 3′-terminal oligonucleotide P, of which partial 2 is thelongest one, and can now be isolated as described for partial 3.

If oligonucleotide P occurs n times in a strand, a partial “i” can beisolated by first preparing a partial (or partials) in whicholigonucleotide P_(i) is the P which is furthest downstream, i.e., apartial whose terminal oligonucleotide lies between P_(i) and P_(i+1)and does not occur downstream of P_(i+1). Once partial P_(i) is thelongest partial in a mixture with shorter partials, it is isolated fromthe shorter partials by making use of an oligonucleotide that liesbetween P_(i) and P_(i+1), and does not occur upstream of P_(i+1), asdescribed above.

4.2. Preparation of Partial Strands that have Both Ends Truncated—

The methods described above in Examples 3.1 to 3.4 allow a nested set ofall possible one-sided partials of a nucleic acid strand to be obtained.Desired one-sided partials can be prepared from either the direct or thecomplementary copies of a parental strand, or from a mixture of strandscontaining either the direct or complementary copies of each strand (forexample a mixture of strands obtained by amplifying sorted strands in anasymmetric PCR to obtain either direct copies of the strands or theircomplementary copies). Partials can also be prepared from samples havingboth direct and complementary copies of parental strands present, suchas a mixture of strands obtained by amplification of sorted strands in asymmetric PCR. Even using such a mixture, partials of the direct andcomplementary copies can be obtained separately. This can be carried outeither on separate arrays, or on the same array. If one array is usedfor partialing both the direct and complementary copies of a parentalstrand, partials from either copy can be separately prepared byselectively amplifying partials of the direct copies or by selectivelyamplifying partials of the complementary copies at different times(using different combinations of primers as described in Example 3.1.1,above).

One-sided partials have one end fixed, and the other end variable, sothat each partial corresponds to a parental strand having one endtruncated to a different extent, i.e., a complete set of partialscorresponds to the parental strand truncated at one end to all possibleextents (see FIG. 9). Either end of the parental strand can betruncated. This can be done, for example, by randomly degrading aparental strand and sorting the partials obtained according to their 3′termini on a 3′ binary array; or by sorting the partials according totheir 5′ termini on a 5′ binary array. Alternatively, one-sided partialshaving either 3′ ends truncated or 5′ ends truncated, as desired, can beobtained by truncating either the direct copy or the complementary copyof a partial strand. For example, one-sided partials can be generated bytruncation of either the direct or complementary copies at their 3′ endsusing an appropriate method of Examples 3.1 or 3.2. Asymmetric PCR canthen be employed to amplify only direct copies of the partials of directcopies of the parental strand; or to amplify only complementary copiesof the partials of complementary copies of the parental strand. Sincethe 3′ end of a complementary strand corresponds to the 5′ end of adirect strand, the first set of amplified partials will comprise thedirect strand truncated at its 3′ end, and the second set of amplifiedpartials will comprise the direct strand truncated at its 5′ end. Ofcourse, asymmetric PCR can also be used to amplify only complementarycopies of the direct strand partials, and to amplify only direct copiesof the complementary strand partials, which will comprise thecomplementary strand truncated at the 5′ end and the 3′ end,respectively. Thus, every possible one-sided partial, comprising eitherthe direct copy or the complementary copy of a parental strand, that istruncated at either the 3′ end or the 5′ end, can be prepared by themethods of this invention.

The one-sided partials obtained can themselves be subjected to secondpartialing according to the invention, wherein the former fixed end istruncated to any extent using the techniques described above forpreparing one-sided partials (see Examples 3.1 to 3.4, above), therebyresulting in two-sided partials. If comprehensive arrays are used ineach of the two consecutive rounds of strand partialing, the two-sidedpartials obtained can be any segments desired of the original parentalstrands.

For example, to prepare a segment bordered in a strand by two internaloligonucleotides, “a” and “b”, one sided partials of the strand canfirst be produced, resulting in both “a” and “b” being at the variabletermini of partials. Assuming that oligonucleotide “a” lies in theparental strand between oligonucleotide “b” and the fixed end, thecontents of the well in the array where the partials terminating with“b” have been sorted to (i.e., the well containing an immobilizedoligonucleotide complementary to “b”), is withdrawn and partialed againon a second array. The second array is chosen to prepare partials havingoligonucleotide “b” at the fixed end. The segment bordered by “a” and“b” will be found in this second array in the well where the partialsterminating with “a” have been sorted to (i.e., in the well whoseaddress is “a”). Of course, it is not necessary to prepare all one-sidedpartials of the original parental strand or all one sided partials ofthe partial strand terminating with “b” using comprehensive arrays.Rather, provided that the relative location of “a” and “b” in the strandis known, only two wells with addresses “a” and “b” in the arrays arerequired. In the first round of the procedure, a sample containing theparental strand is partialed in well “b” (wherein partials terminatingwith “b” are generated), and the contents of this well is partialed inwell “a” in the second round. Wells “a” and “b” may even belong to thesame array. Furthermore, a single array can be used for simultaneouslypreparing (in a two-stage procedure) a large number of segments borderedby any of chosen pairs of oligonucleotides in any of the strands thatare present in a bank. Many variations of this technique will beapparent for obtaining the same results.

Using this technique, any desired segment of a nucleic acid strand of aknown or partially known sequence can be precisely “excised” andamplified (e.g., by the use of “cleavable primers” as is describedbelow), irrespective of the presence or absence of restriction sites,and without the need for synthesizing specific oligonucleotide primers.

Of course, if the combination of two oligonucleotides that border theexcised segment occurs more than once in the group of partials that arepresent in the same well of a first partialing array, then there will beseveral different products of such a double truncation. If this occurs,individual two-sided partials can be isolated by the method describedabove in Example 4.1.2, by sorting according to their internal sequences(see Examples 2.1 and 2.2, above), or by any other separation techniqueknown in the art (e.g., by gel electrophoresis, as described by Sambrooket al., 1989).

4.3, Cleavable Primers—

Amplification of strands and partials following separation (orgeneration) on a sectioned oligonucleotide array requires that theirends be provided with priming regions, either one (for linearamplification) or two (for exponential amplification). These primingregions (generally terminal extensions), however, can be undesirable inthe subsequent use of the amplified strands or partials, such as themaking of recombinants or site-directed mutants (see Examples 4.4 and4.5, below). Additionally, for some uses of the amplified strands orpartials it is desirable to substitute new priming regions for oldpriming regions. For those uses, the primers used for amplification mustfirst be removed from the 5′ ends of strands or partials. Where thejunction of the primer and the strand (or partial) is contained within aunique restriction site, the primer can be removed by treating adouble-stranded version of the strand (or partial) with a correspondingrestriction endonuclease. However, restriction sites will often not bepresent at the junctions. A solution to this problem is to make theprimer (or even only the junction nucleotide in the primer) chemicallydifferent from the rest of the strand (or partial), as described below.Below are several examples of such an approach. The primer in theseexamples resides at the strand's 5′ terminus.

4.3.1. Cleavage of Primers by Alkaline Hydrolysis or by RibonucleaseDigestion—

This method is suitable for removal of oligoribonucleotide primers, ormixed RNA/DNA primers whose 3′ terminal nucleotide (which becomes ajunction nucleotide upon primer extension) is a ribonucleotide. Suchprimers are incorporated at the 5′ end of DNA strands or partials duringthe strands' or partials' amplification described elsewhere herein.

Alkaline hydrolysis cleaves a phosphodiester bond that is on the 3′ sideof a ribonucleotide, and leaves intact a phosphodiester bond that is onthe 3′ side of a deoxyribonucleotide [Wyatt, J. R. and Walker, G. T.(1989). Deoxynucleotide-containing Oligoribonucleotide Duplexes:Stability and Susceptibility to RNase V1 and RNase H, Nucleic Acids Res.17, 7833-7842]. After alkaline hydrolysis, the pH of the reactionmixture is returned to a neutral value by the addition of acid, and thesample can be used without purification.

Primers containing a riboadenylate or a riboguanylate residue at their3′ end can effectively be removed from a DNA strand or partial bytreatment with T₂ ribonuclease [Scaringe, S. A., Francklyn, C. andUsman, N. (1990). Chemical Synthesis of Biologically ActiveOligoribonucleotides Using β-Cyanoethyl Protected RibonucleosidePhosphoramidites, Nucleic Acids Res. 18, 5433-5441]. After treatment,the sample is heated to 100° C. to inactivate the ribonuclease, and canbe used without purification.

In both these cases, the released 5′ terminus of the strand (or partial)is left dephosphorylated. Therefore, if the strand obtained issubsequently used for ligation, it should be phosphorylated byincubation with polynucleotide kinase (as described in Example 1.4,above).

4.3.2. Cleavage of Primers from DNA Strands (or Partials) Synthesizedfrom Phosphorothioate Nucleotide Precursors—

In this method, oligodeoxynucleotide or oligoribonucleotide primers aresynthesized from natural nucleotides, but strand amplification iscarried out in the presence of only α-phosphorothioate nucleotideprecursors (as described in Example 3.2.2, above). Subsequent digestionof the synthesized strands with a 5′-3′ exonuclease, such as calf spleen5′-3′ exonuclease results in the elimination of all primer nucleotidesexcept the original 3′-terminal (junction) nucleotide of the primer,with the released 5′-terminal group of a strand or partial beingunphosphorylated [Spitzer, S, and Eckstein, F. (1988). Inhibition ofDeoxyribonucleases by Phosphorothioate Groups in Oligodeoxynucleotides,Nucleic Acids Res. 16, 11691-11704]. The junction nucleotide is notremoved, because it is joined to the rest of the strand by aphosphorothioate diester bond. Therefore, the strand obtained has anextra nucleotide at its 5′ end. This does not present a problem when thepresence of the former junction nucleotide at the 5′ end of the strandis compatible with the subsequent use of the strand. The presence of theextra nucleotide can also be useful for site-directed mutagenesis(described in 4.5, below).

If the primer-deprived strand obtained by this method is to be used forligation, the use of spleen exonuclease, which leaves 5′-hydroxylgroups, must be then followed by phosphorylation of the strand utilizingpolynucleotide kinase. Therefore, where the strand is to be ligated, theuse of bacteriophage lambda or bacteriophage T7 5′-3′ exonuclease ispreferable over spleen exonuclease, since they leave 5′-phosphorylgroups at the site of cleavage [Sayers, J. R., Schmidt, W. and Eckstein,F. (1988). 5′-3′ Exonucleases in Phosphorothioate-basedOligonucleotide-directed Mutagenesis, Nucleic Acids Res. 16, 791-802].

4.3.3. Removal of Priming Regions from 3′ Ends—

After a primer is removed from the 5′ end of a strand, the strand can beused as a template for the synthesis of complementary copies, such asdescribed for the linear amplification of partials (see Example 3.3.1,above). The complementary product strands will not contain a 3′-terminalpriming region. If desired, the primer used for this copying can be alsomade cleavable using one of the methods described above in Examples4.3.1 or 4.3.2. Since any strand or partial can be obtained in any ofthe complementary versions (described in Examples 1.5 and 3.1.1, above),it is possible to deprive any strand or partial of either its 5′ or 3′priming region, or both of them.

4.4. Generation of Recombinant Nucleic Acids—

With the ability using the invention to excise, amplify, and isolate anysegment of any strand of known or partially known sequence, and with theability to introduce and to remove priming regions at the segment'stermini (and, therefore, to substitute one priming region for another,if necessary), it is possible to prepare any desired recombinant nucleicacid by employing a standard nucleic acid ligation technique (Sambrooket al., 1989), and then to amplify the recombinant by PCR. Usingsectioned arrays, thousands of recombinants can be preparedsimultaneously, if desired. Also, in many cases, specific recombinationscan be carried out on the arrays without prior purification of one orboth of the nucleic acids to be ligated.

In the methods described below, two nucleic acid strands are ligated inone round of ligation. It is of course possible to repeat the ligationprocess, ligating the recombinant product to another strand, and to keeprepeating the process any desired number of times to ligate the desirednumber of strands.

4.4.1. Use of Oligonucleotides Immobilized on an Array asSequence-Specific “Splints” for the Ligation of Nucleic Acids—

In this example, a sectioned array contains immobilized oligonucleotidesthat consist of two portions, one being complementary to the 3′-terminalsequence of one of the moieties to be ligated, and the other beingcomplementary to the 5′-terminal sequence of the other moiety to beligated. The immobilized oligonucleotides can have either free 3′ or 5′ends. The relevant termini of the nucleic acids to be ligated should bedeprived of priming regions (as described in Example 4.3, above), butpriming regions (preferably different) should be preserved at theopposite termini of the nucleic acids to allow amplification of therecombinants. After hybridization in an appropriate well of the array,the two nucleic acid strands are ligated to each other utilizing DNAligase [Landegren, U., Kaiser, R., Sanders, J. and Hood, L. (1988). ALigase-mediated Gene Detection Technique, Science 241, 1077-1080;Barany, F. (1991). Genetic Disease Detection and DNA Amplification UsingCloned Thermostable Ligase, Proc. Natl. Acad. Sci. U.S.A. 88, 189-193].Unligated strands are then washed away. Only the ligated strands possessthe two terminal priming regions that are required for subsequent PCRamplification. The strands that are to be ligated can be used in amixture with other strands, provided that there are no other strands inthe mixture with the same oligonucleotides at the termini that have beendeprived of priming regions.

The sectioned array for performing ligation on can have immobilizedoligonucleotides with either their 5′ or 3′ termini free. It is usuallyimpracticable to use a comprehensive array for this purpose; rather, anew array is preferably prepared for the purpose of generating aparticular set of recombinants that includes only the particularimmobilized oligonucleotides that are required. These immobilizedoligonucleotides can be relatively long to ensure a high specificity ofhybridization. All of the required oligonucleotides can be synthesizedsimultaneously on the array before the recombination procedure iscarried out utilizing, for example, a photolithographic technique (Foderet al., 1991).

A specific application of this method is to ligate many differentstrands to one particular strand or partial, for example, in order toproduce many recombinant variations of one gene. In that case, oneportion of the splint, i.e., the immobilized oligonucleotide, is aconstant segment, and the other portion of the splint is a variablesegment, i.e., the array used is a binary array. The constant segmentbinds to the strand that was chosen to be included in every recombinantand the variable segment binds to the end of another strand or partialthat is chosen to be fused with the invariant strand.

4.4.2. Method for Producing Recombinants in which One Nucleic Acid to beCombined is Ligated to the Free End of a Hybrid Formed Between AnotherNucleic Acid to be Combined and the Immobilized Oligonucleotide—

In one embodiment of this method, a blunt-ended double-stranded nucleicacid fragment is ligated in an individual area of an array to asingle-stranded DNA (or RNA) that is hybridized by its terminus to thecomplementary oligonucleotides immobilized in that area. The array is anordinary array and can be either 3′ or 5′, depending on whether thesingle-stranded nucleic acid is to be hybridized to the immobilizedoligonucleotide by its 5′ end or by its 3′ end, respectively. Thehybrids are ligated to the double-stranded fragment by incubation with aDNA ligase [Sambrook et al. 1989], and this can only occur when the freeterminus of the immobilized oligonucleotide and the complementaryterminus of the hybridized strand are perfectly aligned to produce ablunt end.

The single-stranded nucleic acid to be ligated is selected according tothe identity of its hybridized end. Therefore, it need not be separatedfrom other strands, provided that all the other strands in the mixturehave dissimilar terminal sequences. On the other hand, thedouble-stranded fragment to be ligated is not selected using our method,and therefore it must be isolated from other fragments if it is desiredto obtain an individual recombinant nucleic acid. The 5′ termini to beligated must be phosphorylated (this can apply to the immobilizedoligonucleotide, if a 5′ array is used). To ensure the properorientation of the double-stranded fragment, the end of the fragmentopposite to that which is to be ligated should not be compatible withligation to the immobilized oligonucleotide/single-stranded nucleic acidhybrid. Means for making double-stranded nucleic acid ends incompatibleare well known. The non-ligating ends of both the double-strandedfragment and the single-stranded nucleic acid should preferably beprovided with priming regions before their ligation to each other, sothat the ligated strands can be exponentially amplified in a subsequentPCR. Preferably, the priming regions are different, so that the ligatedstrands are selectively amplified. (If the end of the fragment which isnot intended to be ligated in fact incorrectly ligates, the resultingproduct will not be amplified during PCR because the two primers willhybridize to the same strand).

The double-stranded fragment can be obtained, for example, by copying astrand that has had its 5′-terminal primer removed but retains a 3′terminal priming region (using techniques described elsewhere herein)(see Example 4.3, above). The primer-deprived end is the ligating end,and should be phosphorylated before copying the strands (if cleavage ofthe primer results in a 5′-hydroxyl group at this end). The primer usedfor copying the strand occurs at the non-ligating side of the fragmentand should be non-phosphorylated to prevent ligation at that side. Toprevent ligation of the 3′ end at this fragment side, the 3′-hydroxylgroup of the strand to be copied can be blocked by a conventionalchemical modification. Alternatively, this side of the double-strandedfragment can be made not blunt by, during strand copying, using a primerwhose 5′-terminal nucleotide is displaced in either direction withrespect to the 3′ terminal nucleotide of the copied strand. In otherwords, the primer is chosen to hybridize to the strand so that either itprotrudes, or the strand protrudes, resulting in a non-blunt end whichis incompatible with the ligating end of the immobilizedoligonucleotide/single stranded nucleic acid hybrid. This approach canlimit the amount of improper ligation.

Different pairs of single-stranded nucleic acids and double-strandedfragments can be ligated in each well of the array. Alternatively, if itis desired to have a collection of recombinants wherein only one moietyis varied and the other is the same, the double-stranded fragment canconsist of a constant sequence and be ligated to variablesingle-stranded nucleic acids in wells of an array. In this method, asopposed to the method of Example 4.4.1, an array of specially designedoligonucleotides need not be prepared to produce a particular set ofrecombinants. Rather, one can use an ordinary comprehensive array ofrelatively short oligonucleotides immobilized thereon.

In another embodiment of this method, a purified double-strandedblunt-ended fragment is ligated to the 3′ ends of oligoribonucleotidesimmobilized in a well of a 3′ ordinary array by incubation with T4 RNAligase (Higgins et al., 1979). After unligated material is washed away,a single-stranded nucleic acid, either isolated or in a mixture withother strands with different terminal sequences, is hybridized to theimmobilized partially double-stranded complex and then ligated by itsphosphorylated 5′ end to the 3′ end of the double-stranded fragment byincubation with DNA ligase.

4.5. Site-Directed Mutagenesis—

The ability to prepare any partial of a strand according to theinvention provides the opportunity to make nucleotide substitutions,deletions and insertions at any chosen position within a nucleic acid.Moreover, the use of sectioned arrays makes it possible to performsite-directed mutagenesis at a number of positions (even at allpositions) at once, and in a particular embodiment, to determine, withinindividual wells of the array, properties of the encoded mutantproteins.

According to the methods described below for site directed mutagenesis,mutations are introduced into a nucleic acid strand by first preparingpartials having variable ends that correspond to the segment of thestrand to be mutated, that segment preceding the location of theintended mutation. Then mutagenic nucleotides or oligonucleotides areintroduced into the partials at their variable ends. The mutatedpartials are then extended the length of the full sized strand using thecomplementary copy of the original non-mutated strand as a template.

Of course, more than one site directed mutation can be introduced into astrand in one procedure. For example, it may be desired to introducemutations into a strand at positions “a”, “b”, and “c” (in the orderthose positions appear in the strand). A partial can first be preparedhaving on its variable 3′ end an oligonucleotide segment that justprecedes position “a” in the parental strand. Then a sequence containinga mutation at position “a” can be introduced into the variable end ofthe partial (i.e. 3′ end). The resulting first mutated partial isextended using as a template a longer partial that is complementary tothe partial that ends (in the parental strand) just in front of position“b”. Then a sequence containing mutation “b” is introduced into theextended terminus of the partial that contains the mutation at position“a”. The resulting double mutated partial is extended on a template thatis complementary to the partial that ends just in front of position “c”.The process is repeated with mutation “c” using for the last desiredextension a template that encodes the remaining portion of the strand tobe mutated (for example, this can be a complement of the full sizedstrand).

For mutagenesis, partials that have identical variable termini, but thatoriginated from different parental strands, need not be separated fromone another. However, if a particular oligonucleotide segment occursmore than once in a strand to be mutated, the corresponding partialsmust be separated from one another before mutagenesis, as described inExample 4.1.2, above.

4.5.1. Mutagenesis Involving Ligation of Partial Strands to ImmobilizedOligonucleotides—

In this method, complements of nucleic acid partials (i.e., strandswhose 5′ termini are variable and 3′ termini are fixed) are used. Their5′-terminal priming regions are removed by complete alkaline digestionor by ribonuclease digestion of their incorporated cleavable primers(see Example 4.3.1, above). The resulting 5′ termini are phosphorylatedby incubation with polynucleotide kinase, and the partials are thenligated by incubation with RNA ligase to the free 3′ hydroxyls ofoligoribonucleotides that are immobilized on the surface of a 3′sectioned ordinary array. The sequence of the immobilizedoligonucleotide to which a partial is ligated is identical to theoligonucleotide segment that occurs in the original (full-length) strandimmediately adjacent to the end of the partial, except for one (or afew) nucleotide difference(s) that corresponds to mutation(s) to beintroduced.

The nucleotide differences are preferably located at the 3′ terminus ofthe immobilized oligonucleotide, and can correspond to a nucleotidesubstitution, insertion, or deletion. A deletion can be of any size. Fora large insertion, the ligated partial, or the immobilizedoligonucleotide, can first be fused to a nucleic acid containing all orpart of the sequence to be inserted, using the method described inExample 4.4, above.

After washing away the material that is not covalently bound to thearray, the immobilized strand is linearly copied, taking advantage ofthe priming region at its (fixed) 3′ end. The copies obtained correspondto partials that have been extended by the oligonucleotides containingthe mutation(s). These copies are then annealed to their complementaryfull-length strands, and their 3′ termini extended by incubation withDNA polymerase, using the annealed complementary parental strand as atemplate. Finally, the extended mutant strands are amplified by PCR. Itis important that the pair of primers utilized for amplification of apartial used for mutagenesis, are different from the primers used toamplify the original (non-mutant) full-length strand. This assures thatonly mutant strands are amplified.

If the aim of this procedure of the invention is protein engineering,then the amplified mutant strands can be transcribed and translated.Transcription and translation can be carried out either on the samearray, or on a replica array. An RNA polymerase promoter should beincluded in advance in one of the primer regions of the mutant strand.For translation, the components of a cell-free translation system shouldbe added to the reaction mixture in each well. [Anderson, C. W., Straus,J. W. and Dudock, B. S. (1983). Preparation of a Cell-freeProtein-synthesizing System from Wheat Germ, Methods Enzymol. 101,635-644; Bujard, H., Gentz, R., Lanzer, M., Stueber, D., Mueller, M.,Ibrahimi, I., Haeuptle, M.-T. and Dobberstein, B. (1987). A T5Promoter-based Transcription-translation System for the Analysis ofProteins in vitro and in vivo, Methods Enzymol. 155, 416-433; Tymms, M.J. and McInnes, B. (1988). Efficient in vitro Expression of Interferon αAnalogs Using SP6 Polymerase and Rabbit Reticulocyte Lysate, Gene Anal.Tech. 5, 9-15; Baranov, V. I., Morozov, I. Yu., Ortlepp, S. A. andSpirin, A. S. (1989). Gene Expression in a Cell-free System on thePreparative Scale, Gene 84, 463-466; Ueda, T., Tohda, H., Chikazumi, N.,Eckstein, F. and Watanabe, K. (1991). Phosphorothioate-containing RNAsShow mRNA Activity in the Prokaryotic Translation Systems in vitro,Nucleic Acids Res. 19, 547-552; Lesley, S. A., Brow, M. A. and Burgess,R. R. (1991). Use of in vitro Protein Synthesis from Polymerase ChainReaction-generated Templates to Study Interaction of Escherichia coliTranscription Factors with Core RNA Polymerase and for Epitope Mappingof Monoclonal Antibodies, J. Biol. Chem. 266, 2632-2638]. Thetranslation products in each well can then be assayed as desired. Forexample, the proteins can be assayed in situ for activity (if they areenzymes), or they can be assayed for the presence of particularantigenic determinants (for example, by determining the ability of eachprotein to bind to an array of immobilized antibodies).

4.5.2. Nucleotide Substitution by the Addition of a Nucleotide to aPartial's End—

If the purpose of mutagenesis is to substitute a single-nucleotide, asimpler method can be employed than is described in Example 4.5.1,above. The method described below involves the addition of a singlemutagenic nucleotide to the variable end of a partial.

In one embodiment of this method, a primer that is made of naturaloligonucleotides and that is present on the variable end of a partialstrand that was synthesized from phosphorothioate precursors, isremoved, as described above in Example 4.3.2, resulting in theappearance of an extra nucleotide at the partial's 5′ end. By employingduring amplification one of the four primers possible that differ intheir 3′-terminal nucleotide, one can add any desired nucleotide to thepartial's 5′ variable end. The mutated partials are then copied byincubation with DNA polymerase, and the extra nucleotide appears at the3′ end of the copy. The copy is then annealed to a complementaryfull-length strand, and its 3′ terminus is extended by incubation withDNA polymerase, using the full-length strand as a template. The extendedmutant strand is then amplified by PCR using a pair of primers whosesequence is identical to the 5′ terminal priming regions of the annealedmutated partial and the template.

Although the mutant partial's 3′-terminal nucleotide does not match itscounterpart in the original full-length strand, conditions are employedwhereby such unpaired termini are extended by DNA polymerase [Wu, D. Y.,Ugozzoli, L., Pal, B. K. and Wallace, R. B. (1989). Allele-specificEnzymatic Amplification of Beta-globin Genomic DNA for Diagnosis ofSickle Cell Anemia. Proc. Natl. Acad. Sci. U.S.A. 86, 2757-2760]. Thelow efficiency of such extension is compensated for by subsequentexponential amplification of the extended mutant strands.

There is a chance that the unpaired nucleotide, which is loosely boundto the template, will be looped-out during extension, resulting in anucleotide insertion rather than in a nucleotide substitution. Toprevent the affected strands from being amplified, the heteroduplexesthat consist of the mutant strand and the original strand, can betreated, prior to PCR amplification, with a single-strand-specificendonuclease, such as nuclease S1, that cleaves DNA at single-nucleotidebulges, but leaves intact single-base mismatches [Bhattacharyya, A. andLilley, D. M. (1989). The Contrasting Structures of Mismatched DNASequences Containing Looped-Out Bases (Bulges) and Multiple Mismatches(Bubbles), Nucleic Acids Res. 17, 6821-6840].

An alternative approach is to generate a full-length mutant stranddirectly from the modified partial (with an extra nucleotide at its 5′end) without preparing a complementary copy of the modified partial.After the modified partial is annealed to a complementary full-lengthstrand, the protruding single-stranded part of the duplex is filled in,by utilizing the Klenow fragment of DNA polymerase I (which will notdisplace the annealed modified partial) and a primer that iscomplementary to the 3′-terminal priming region of the full-lengthstrand. Then, the extended primer and the annealed modified partial areligated to each other by incubation with DNA ligase. The resultingfull-length mutant strand is then amplified by PCR.

5. Surveying Oligonucleotides with Binary Arrays

Surveying oligonucleotide content can be carried out in a conventionalmanner in the different embodiments of the invention, by hybridizationof detectable nucleic acid strands or partials to an ordinaryoligonucleotide array, and followed by detection of those hybridized.However, with this approach the signal-to-noise ratio is not high enoughto always avoid ambiguous results. The most significant problem in thisrespect is inability to sufficiently discriminate against mismatchedbasepairs that occur at the ends of hybrids. That inability hampers theanalysis of complex sequences [Drmanac, R., Strezoska, Z., Labat, I.,Drmanac, S. and Crkvenjakov, R. (1990). Reliable Hybridization ofOligonucleotides as Short as Six Oligonucleotides, DNA Cell Biol. 9,527-534]. The use of binary arrays in the manner discussed below helpsto overcome this problem.

In some cases binary arrays are also useful for surveying longeroligonucleotides than are easily surveyed on an ordinary array (e.g.,signature oligonucleotides) without increasing the size of the arrayover that of an ordinary array.

The oligonucleotides immobilized in a binary array that is used forsurveys can have either free 5′ or 3′ ends, and the constant segment canbe located either upstream or downstream from the variable segment. Inmost cases, it is preferable that the 3′ ends of immobilizedoligonucleotides be free, and that their constant segments be locatedupstream of the variable segments.

Surveying can be carried out by utilizing sectioned arrays. However, theuse of plain arrays (i.e., not sectioned) is preferable because thesearrays are less expensive and more amenable to miniaturization. Thefollowing methods are based on the use of plain binary arrays andinvolve fragmentation of the strands or partials prior to surveying.

5.1. Surveying DNA Strands—

5.1.1. Comprehensive Surveys of DNA Strands—

In this format, every oligonucleotide segment that is present in astrand or in a partial, or in a group of strands or partials, issurveyed. If a survey of partials is performed in order to establishnucleotide sequences, it is preferable that each partial that isanalyzed be represented by the same sense copies. Thus, there should beonly one of the complementary strands in a sample or the complementarystrands should be differentiable, e.g., one strand should produce eitherno detectable signal or a weaker signal. This can be accomplished byamplifying the partials linearly or by generating a great excess of oneof the complementary strands over the other strand through the use ofasymmetric PCR (see Example 3.1.1, above).

DNA strands (or partials) to be surveyed are preferably digested withnuclease S1 under conditions that destabilize DNA secondary structure(see Example 3.1.2, above). The digestion conditions are chosen so thatthe DNA pieces produced are as short as possible, but at the same time,most are at least one nucleotide longer than the variable segment of theoligonucleotides immobilized on the binary array. If the surveyedstrands or partials have been previously sorted and amplified on asectioned array, this degradation procedure can be performedsimultaneously in each well of that array. Alternatively, if it isdesired to store that array as a master array for later use, the arraycan be replicated by blotting onto another sectioned array (see SectionI, above). The DNA is then amplified within the replica array by(asymmetric) PCR prior to digestion with nuclease S1.

After digestion, the nuclease is inactivated by, for example, heating to100° C., and the DNA pieces are then hybridized to a binary array, whoseimmobilized oligonucleotides' constant segments are pre-hybridized to5′-phosphorylated complementary masking oligonucleotides. Preferably,the constant segment contains a restriction site that has beeneliminated from the internal regions of the DNA strands prior to strandsorting (such as described in Example 1.1, above), and is long enough sothat its hybrid with the masking oligonucleotide is preserved duringsubsequent procedures. The binding of the masking oligonucleotide can bestabilized by introduction of an intercalating group at its 3′ end[Asseline, U., Delarue, M., Lancelot, G., Toulmé, F., Thoung, N. T.,Montenay-Garestier, T. and Helene, C. (1984). Nucleic Acid-bindingMolecules with High Affinity and Base Sequence Specificity:Intercalating Agents Covalently Linked to Oligodeoxynucleotides, Proc.Natl. Acad. Sci. U.S.A. 81, 3297-3301; Gottikh, M. B., Ivanovskaia, M.G., Skripkin, E. A. and Shabarova, Z. A. (1990). Design of NewOligonucleotide Derivatives Resistant to Cell Nucleases, Bioorg. Khim.(Moscow) 16, 514-523].

The array is then incubated with a DNA ligase (for example, as inExample 1.1.1, above), resulting in ligation of the maskingoligonucleotides to only those hybridized DNA strands or partials whose3′ terminal nucleotide is immediately adjacent to the 5′ end of themasking oligonucleotide, and matches its counterpart in the immobilizedoligonucleotide. DNA ligase is especially sensitive to mismatches at thejunction site [Landegren, U., Kaiser, R., Sanders, J. and Hood, L.(1988). A Ligase-mediated Gene Detection Technique, Science 241,1077-1080].

After all non-ligated DNA pieces have been washed away under much morestringent conditions that were used during hybridization, theimmobilized oligonucleotides are extended by incubation with a DNApolymerase, preferably by only one nucleotide, using the protruding partof the ligated DNA piece as a template, and preferably using thechain-terminating 2′,3′-dideoxynucleotides as substrates instead of theconventional 3′-deoxynucleotides. This extension is only possible, ifthe 3′-terminal base of the immobilized oligonucleotide forms a perfectbasepair with its counterpart in the hybridized DNA piece. The use ofthe dideoxynucleotides ensures that all hybrids are extended by exactlyone nucleotide, ensuring that all extended hybrids are of the samelength. The array is then washed under conditions that are sufficientlystringent to remove unextended hybrids.

Thus, at each of the terminal positions of the hybrids, (where thenucleotides are most prone to form mismatches), there must be aperfectly matched basepair for a hybrid to survive washing and bedetected.

Internal mismatches will also occur (at a lower frequency). Thosemismatches can be essentially eliminated by “proofreading” the hybridsthat are formed. This can be done by both chemical and enzymatic meansdescribed below. These methods are also applicable when surveying iscarried out by utilizing ordinary (i.e., non-binary) arrays.

(a) Mismatched bases can be selectively modified by certain chemicalreagents. For example, 1-cyclohexyl-3-(2-morpholinoethyl)carbodiimidequantitatively reacts with mismatched guanidylate and thymidylateresidues, while leaving perfect basepairs intact, including those thatare located at the ends of duplexes [Novack, D. F., Casna, N. J.,Fischer, S. G. and Ford, J. P. (1986). Detection of Single Base-pairMismatches in DNA by Chemical Modification Followed by Electrophoresisin a 15% Polyacrylamide Gel, Proc. Natl. Acad. Sci. U.S.A. 83, 586-590].This modification is very useful because G:T and G:A pairs are the moststable mismatches [Ikuta, S., Takagi, K., Wallace, R. B. and Itakura, K.(1987). Dissociation Kinetics of 19 Base Paired Oligonucleotide-DNADuplexes Containing Different Single Mismatched Base Pairs, NucleicAcids Res. 15, 797-811] and thus more likely to cause an erroneoussignal. In addition, both hydroxylamine and osmium tetroxide selectivelyand quantitatively modify unpaired thymine and cytosine bases [Cotton,R. G. H., Rodrigues, N. R. and Campbell, R. D. (1988). Reactivity ofcytosine and Thymine in Single-base-pair Mismatches with Hydroxylamineand Osmium Tetroxide and Its Application to the Study of Mutations,Proc. Natl. Acad. Sci. U.S.A. 85, 4397-4401]. Because thesemodifications introduce bulky and/or highly hydrated groups into themismatched basepair interface, the duplex structure is dramaticallydistorted, leading to a further decrease in the stability of themismatched hybrids, while the stability of perfectly matched hybridsremains unchanged [Lebowitz, J., Chaudhuri, A. K., Gonenne, A. andKitos, G. (1977). Carbodiimide Modification of Superhelical PM2 DNA:Considerations Regarding Reaction at Unpaired Bases and the Unwinding ofSuperhelical DNA with Chemical Probes, Nucleic Acids Res. 4, 1695-1711].Thus, mere washing of the array after such a chemical treatment willeliminate almost all of the internally mismatched hybrids. Furthermore,the chemically modified nucleotide residues are recognized by repairenzymes, such as ABC excision nuclease, that specifically cleave DNAstrands at the modified sites, resulting in the complete elimination ofthe corresponding mismatched hybrids [Thomas, D. C., Kunkel, T. A.,Casna, N. J., Ford, J. P. and Sancar, A. (1986). Activities and IncisionPatterns of ABC Excinuclease on Modified DNA Containing Single-baseMismatches and Extrahelical Bases, J. Biol. Chem. 261, 14496-14505].

(b) If the array is made of oligoribonucleotides, rather than ofoligodeoxyribonucleotides, the hybrids that are formed when surveyingthe oligonucleotides that are present in DNA can be edited by aribonuclease treatment. Single-base mismatches in RNA:DNA heteroduplexesare recognized by ribonuclease A, which cleaves the RNA strand at thesite of the mismatched basepair and nearby it. Cleavage predominantlyoccurs if the RNA strand contains a mismatched pyrimidine nucleotide. Ifthe RNA strand contains a mismatched purine that is opposite to apyrimidine nucleotide in the DNA strand, then the presence of themismatch can be detected by analyzing the complementary DNA strand,where the relative position of the purines and pyrimidines is reversedand the mismatched pyrimidines will occur in the RNA strand, where itcan be cleaved [Myers, R. M., Larin, Z. and Maniatis, T. (1985).Detection of Single Base Substitution by Ribonuclease Cleavage atMismatches in RNA:DNA Duplexes, Science 230, 1242-1246]. The RNA:DNAduplexes can also be edited by the chemical means described in method(a).

With conventional hybridization methods, the ratio of the signal from aperfectly matched hybrid compared to the signal from a false hybridcontaining a single internal mismatch is between 10 and 100 μWilson, K.H., Blitchington, R., Hindenach, B. and Greene, R. (1988).Species-specific Oligonucleotide Probes for rRNA of Clostridiumdifficile and Related Species, J. Clin. Microbiol. 26, 2484-2488; Zhang,Y., Coyne, M. Y., Will, S. G., Levenson, C. H. and Kawasaki, E. S.(1991). Single-base Mutational Analysis of Cancer and Genetic DiseasesUsing Membrane Bound Modified Oligonucleotides, Nucleic Acids Res. 19,3929-3933). With hybrid proofreading techniques able to eliminate asmany as 99% of mismatches, this ratio can be improved to between 1,000to 10,000, a value that is comparable to the fidelity of most enzymaticreactions.

5.1.2. Detection of Hybrids—

Hybrids can be detected by a number of different means. Unlabeledhybrids can be detected by using surface plasmon resonance techniques,which currently can detect 10′ to 10′ hybrid molecules per squaremillimeter [Schwarz, T., Yeung, D., McDougall, A., Hawkins, E., Craven,F. C., Buckle, P. E. and Pollard-Knight, D. (1991). Detection of DNAHybridization by Surface Plasmon Resonance, in Advances in GeneTechnology: The Molecular Biology of Human Genetic Disease (Ahmad, F.,Bialy, H., Black, S., Howell, R. R., Johnson, D. H., Lubs, H. A., Puett,J. D., Rabin, M. B., Scott, W. A., Van Brunt, J. and Whelan, W. J.,eds.), vol. 1, p. 89, The Miami Bio/Technology Winter Symposium].Alternatively, hybrids can be conventionally labeled, such as withradioactive or fluorescent groups [Landegren, U., Kaiser, R., Caskey, C.T. and Hood, L. (1988). DNA Diagnostics—Molecular Techniques andAutomation, Science 242, 229-237]. Fluorescent labels are moreconvenient to use.

In order to ensure the lowest level of background labeling, it ispreferable to label hybrids in a manner such that its detection isdependent on the success of both a ligation and an extension step. Thiscan be accomplished within the scheme of oligonucleotide surveyingdescribed in Example 5.1.1, above, by labeling the maskingoligonucleotides, and the 2′,3′-dideoxynucleotides used for theextension of the immobilized oligonucleotides, with fluorescent dyespossessing different emission spectra. The fluorescence pattern of thearray can then be scanned at different wavelengths, corresponding to theemission maxima of the two dyes, and only signals from those areas inthe array that emit fluorescence of both colors are taken as a positiveresult. For example, dideoxynucleotides can be labeled with fluorescein(whose fluorescence is of green color), without interfering with theirability to serve as good substrates for both reverse transcriptases andDNA polymerases [Prober, J. M., Trainor, G. L., Dam, R. J., Hobbs, F.W., Robertson, C. W., Zagursky, R. J., Cocuzza, A. J., Jensen, M. A.,and Baumeister, K. (1987). A System for Rapid DNA Sequencing withFluorescent Chain-terminating Dideoxynucleotides, Science 238, 336-341].On the other hand, masking oligonucleotides can be labeled withrhodamine (orange color) or Texas red (red color) [Smith, L. M.,Sanders, J. Z., Kaiser, R. J., Hughes, P., Dodd, C., Connell, C. R.,Heiner, C., Kent, S. B. H., and Hood, L. E. (1986). FluorescenceDetection in Automated DNA Sequence Analysis, Nature 321, 674-679].

After hybrids are extended (concomitant with labeling) and edited, thearray is thoroughly washed to remove all unincorporated label, todestroy unextended hybrids, and to discriminate one more time againstmismatched hybrids that might have remained in the array. A preferredmethod is to wash the array at steadily increasing temperature, with thesignal from each individual area being read at a pre-determined time,when the conditions ensure the highest selectivity for the particularhybrid that forms in that area [Khrapko, K. R., Lysov, Yu. P., Khorlin,A. A., Shik, V. V., Florentiev, V. L. and Mirzabekov, A. D. (1989). AnOligonucleotide Hybridization Approach to DNA Sequencing, FEBS Lett.256, 118-122]. Other conditions (such as denaturant and/or saltconcentration) can also be controlled over time. The fluorescencepattern can be recorded at predetermined time intervals with a scanningmicrofluorometer, such as an epifluorescence microscope [Fodor et al.,1991].

5.1.3. Surveys of Selected Oligonucleotides in DNA Strands—

Selected oligonucleotides present in a DNA strand, or a group ofstrands, can be surveyed on a binary array, whose immobilizedoligonucleotides' variable segments comprise a collection of sequencesthat are complementary to the sequences of interest that may occur inthe DNA sample being analyzed. These selected oligonucleotides may be,for example, a catalog of short oligonucleotide segments of a genomethat are of special interest. For example, they may be segments whosealteration frequently results in (or accompanies) a disease. They mayalso be particularly variable segments whose identification, forexample, can help to establish who the actual parents of a particularperson are (i.e., they are rapidly evolving segments). In these cases,the variable regions of the immobilized oligonucleotide can be chosen sothat they are long enough to be unique, or relatively unique, in thegenome. The analyzed sample can, for example, be a group of genomefragments (see Examples 1.1 to 1.4 and Example 1.6), or a mixture ofstrands obtained, for example, through the use of whole-genome PCRutilizing a set of selected primers that are targeted to particulargenome regions [Kinzler, K. W. and Vogelstein, B. Whole Genome PCR:Application to the Identification of Sequences Bound by Gene RegulatoryProteins, Nucleic Acids Res. 17, 3645-3653 (1989)].

5.1.4. Surveys of Signature Oligonucleotides—

Binary arrays are useful for surveying signature oligonucleotidespresent in the sorted DNA fragments. The identification of signatureoligonucleotides helps to establish the order of restriction fragmentsof digested chromosomes (see section V, above). A signatureoligonucleotide consists of a variable oligonucleotide segment of apre-selected length and an adjacent recognition site for a chosenrestriction endonuclease. Accordingly, the constant segment of theimmobilized oligonucleotide in the binary array includes, in this case,the sequence that is complementary to this restriction site. Incontradistinction to comprehensive surveying, which is described inExample 5.1.1, above, a masking oligonucleotide should not protect thisportion of the constant segment, so that this portion is able tohybridize to a signature oligonucleotide in the fragment. The procedureitself is that described in Examples 5.1.1 and 5.1.2, above. However,because the surveyed oligonucleotides are longer in this case, the DNAto be analyzed should be degraded into longer pieces.

5.2. Surveying RNA Strands—

As is the case for DNA strands, comprehensive surveys can be carried outto determine all the oligonucleotides that occur in RNA strands (e.g.,for sequencing), or only selected oligonucleotides can be surveyed(e.g., to identify RNAs of a known sequence in a clinical sample). Incontradistinction to DNA strands, RNA strands (or partials) can bedegraded randomly under non-denaturing conditions (e.g., by treatmentwith a mixture of nuclease S1 and ribonuclease V1, as described inExample 3.4.1, above). The resulting RNA pieces can be ligated tomasking oligonucleotides after hybridization to the array by utilizingDNA ligase (see Example 5.1.1, above). Alternatively, RNA 3′ termini canbe ligated in solution (after nuclease inactivation by heating) to anoligoribonucleotide or an oligodeoxyribonucleotide by utilizing RNAligase (as described in Example 3.3.1, above). The extended RNA piecesare then hybridized to a binary array whose oligonucleotides' constantsegments are complementary to the ligated oligonucleotide. Afterligation, the procedure is as described for surveying DNA strands(Example 5.1.1, above). Double labeling of hybrids at their termini (asdescribed in Example 5.1.2, above) is preferable, in order to enhancethe specificity of hybrid detection. RNA hybrids can also be proofreadby the methods described for DNA strand surveying (see Example 5.1.1,above). In that case, ribonuclease editing is more effective if thearray contains immobilized oligoribonucleotides, because both strands ofa hybrid can be cleaved when the hybrid contains mismatched pyrimidines.

6. Examples of Interpretation of Oligonucleotide Information Obtainedfrom Surveys of Partial Strands, for Determining the NucleotideSequences of a Mixture of Nucleic Acid Strands

6.1 Determination of the Nucleotide Sequences of Strands in a Mixturewhen Each Strand Possesses at Least One Oligonucleotide that does notOccur in any Other Strand in the Mixture—

FIGS. 18 to 28 depict the determination of the sequences of two mixedstrands using the methods of the invention. The example demonstrates thepower of the invention to identify all of the oligonucleotides that arepresent in a strand (i.e., its strand set) when that strand possesses atleast one oligonucleotide that does not occur in any other strand in themixture. In particular, the example demonstrates: (a) how the dataobtained by surveying the partial strands generated from a mixture ofstrands and sorted by their variable termini (i.e., the upstream subsetof each address) and the inferred downstream subset of each address(which together form the indexed address sets) are used to construct theunindexed address sets; and (b) how the unindexed address sets arecompared to each other to identify prime sets, i.e., address sets thatcontain only one strand set. The example also demonstrates how theoligonucleotides that are contained in a strand set are assembled intothe sequence of the strand, even though the primary data is obtainedfrom a mixture of strands. In particular, the example demonstrates: (a)how the oligonucleotides in a strand set are assembled into sequenceblocks; (b) how the contents of the indexed address sets are filtered sothat only information pertaining to the oligonucleotides in a particularstrand set remains; (c) how this filtered oligonucleotide data isre-expressed in terms of the sequence blocks that are contained in thatparticular strand; (d) how the information contained in the resulting“block sets” is used to identify those blocks that definitely occur onlyonce in the strand (“unique blocks”) and to identify those blocks thatcan potentially occur more than once in the strand; (e) how theinformation contained in the block sets of unique blocks is used todetermine the relative order of the blocks that occur only once in thestrand; (f) how the information contained in the block sets limits thepositions at which the other blocks can occur (relative to otherblocks); and (g) how a consideration of the sequences at the ends ofblocks, in combination with a consideration of the relative positions ofthe blocks, leads to the unambiguous determination of the completesequence of the strand. This example also illustrates: (a) howoligonucleotides that occur more than once in a strand are identifiedand located within the sequence, even though the survey data contain noinformation as to the number of times a particular oligonucleotideoccurs in a partial or a mixture of partials having the same terminaloligonucleotide; and (b) how the sequences of different strands in amixture can be determined separately, despite the fact that many of theoligonucleotides occur in more than one strand in the mixture.

FIG. 18 a shows the sequences of two short strands (parental strands)that are assumed to be present in a mixture (with no other strands). Itis also assumed that complete sets of partials have been generated fromthis mixture of strands, and that each set of partials has beenseparately surveyed, with the partials sharing the same addressoligonucleotide being surveyed together. For the purpose of illustratingthe method of analyzing the data, it is assumed that the addressoligonucleotides and the surveyed oligonucleotides are three nucleotidesin length. In practice, longer oligonucleotides should be used. However,for the purpose of illustration it is easier to comprehend an examplebased on trinucleotides. The same methods of analyzing the data applywhen longer oligonucleotides are surveyed, when much longer strands arein the mixture, and when the mixture contains many more strands.

FIG. 18 b shows the upstream subsets determined by surveying eachrelevant address in the partialing array (shown on the left), and thedownstream subsets inferred by the method described above in section V(shown on the right), (i.e., FIG. 18 b shows indexed address sets). Theaddress oligonucleotides (shown in bold letters) are listed verticallyin the center of the diagram. The oligonucleotides listed horizontallyto the left of each address oligonucleotide are those oligonucleotidesthat were detected in a survey of the partials at that address (theupstream subset). The oligonucleotides listed horizontally to the rightof each address oligonucleotide are those oligonucleotides that areinferred from the upstream subsets to occur downstream of that addressoligonucleotide (the downstream subset). For example, oligonucleotide“ACC” is contained in the upstream subset of the address oligonucleotide“CCT”. This means that oligonucleotide “CCT” occurs downstream ofoligonucleotide “ACC” in at least one of the strands in the mixture.Therefore “CCT” is inferred to be in the downstream subset of addressset “ACC”. The remaining downstream oligonucleotides in all of theaddress sets are similarly inferred. Note that an addressoligonucleotide is always a member of its own upstream and downstreamsubsets.

After the indexed address sets of all the addresses in the parentalstrands have been determined (as shown in FIG. 18 b), the information isorganized into unindexed address sets (FIG. 18 c), having no divisioninto downstream and upstream subsets, but merely listing, for eachaddress oligonucleotide, those oligonucleotides that occur in either theupstream or downstream subset (or that occur in both subsets). In FIG.18 c, the address oligonucleotides (shown in bold letters) are listedvertically on the left side of the diagram. Note that the addressoligonucleotide is always a member of its own unindexed address set.

Unindexed address sets are then grouped together according to theidentity of the oligonucleotides that they contain (FIG. 18 d).Unindexed address sets that contain an identical set of oligonucleotidesare grouped together. It can be seen that three groups of address setsare formed in this example. The groups are identified by the Romannumerals (I, II, and III). The address oligonucleotides of each group(for example, CTA, GTC, and TCC in group II) always occur together in astrand. The group of address oligonucleotides can occur together in morethan one strand.

Each group of identical address sets is then compared to all othergroups of identical address sets to see if its common address setappears to be a prime address set. This is accomplished for each addressset by seeing whether any other address set is a subset of it. Forexample, in FIG. 17 d, the address set common to group III is not aprime address set, because the address set common to group I is a subsetof the address set common to group III. However, the address set commonto group I does appear to be a prime address set, because neither theaddress set common to group II, nor the address set common to group III,is a subset of the address set common to group I. Similarly, the addressset common to group II appears to be a prime address set.

Each putative prime address set is then tested to see if it is a strandset. This is accomplished by examining all the address sets that containall of the oligonucleotides that are present in the putative primeaddress set. For example, in FIG. 19 a, all the address sets thatcontain all the oligonucleotides that are present in the putative primeaddress set common to group I are listed together (namely the addresssets contained in groups I and III). The address oligonucleotides areshown in bold letters on the left side of the diagram, and the groupsare identified by Roman numerals. The address set common to group I isindeed a prime address set (and therefore it contains a single strandset) because a list of the eleven oligonucleotides that are found inevery address set in the diagram (they are seen as full columns) isidentical to the list of eleven addresses on the left side of thediagram. Similarly, FIG. 19 b shows why the address set common to groupII is also a prime set. In particular, the twelve oligonucleotidescommon to every address set in the diagram are all found in the list oftwelve addresses on the left side of the diagram. Had either of theseputative prime address sets not turned out to indeed be a prime set (bythe criterion described above), then it would have been identified as apseudo-prime address set, and further analysis would have been requiredto decompose it into its constituent strand sets (as will be shown inExample 6.2, below).

Once the strand sets in a mixture have been identified, theoligonucleotides in each strand set can be assembled into the nucleotidesequence of the strand. This is accomplished in a series of steps, asillustrated in FIG. 20 (which utilizes the strand set determined in FIG.19 a).

First the oligonucleotides in the strand set are assembled into sequenceblocks. A sequence block contains one or more uniquely overlappingoligonucleotides. Two oligonucleotides of length n, uniquely overlapeach other if they share an identical sub-sequence that is n−1nucleotides long and no other oligonucleotides in the same strand setshare that sub-sequence. For example, for the strand set shown in FIG.20 a, the oligonucleotides “CAT” and “ATG” share the sub-sequence “AT”which does not occur in other oligonucleotides. These twooligonucleotides therefore uniquely overlap to form the sequence block“CATG”, as shown in FIG. 20 b. Similarly, oligonucleotide “TGG” uniquelyoverlaps oligonucleotide “GGT” by the common sub-sequence “GG”, andoligonucleotide “GGT” also uniquely overlaps (on its other end)oligonucleotide “GTA” by the common sub-sequence “GT”. Thus, the threeoligonucleotides (“TGG”, “GGT”, and “GTA”) can be maximally overlappedto form sequence block “TGGTA”. In forming sequence blocks, thefollowing rule is adhered to: two oligonucleotides can be included inthe same block if they are the only oligonucleotides in the strand setto possess their common sub-sequence. Thus, “ATG” does not uniquelyoverlap “TGG”, because the strand set contains a third oligonucleotide,“TTG”, that shares the common sub-sequence “TG”. If, following theserules, an oligonucleotide does not uniquely overlap any otheroligonucleotide, then a sequence block consists of only thatoligonucleotide. For example, “TAA” forms its own block. Following theabove rules, the eleven oligonucleotides that occur in strand set A canbe assembled into four sequence blocks.

Second, the data contained in the indexed address sets shown in FIG. 18b are filtered to remove extraneous information that does not pertain tostrand set A. FIG. 20 c shows the resulting filtered address sets. Alladdress sets whose address oligonucleotide is not one of theoligonucleotides in strand set A are eliminated. In addition, alloligonucleotides that are not members of strand set A are removed fromthe upstream and downstream subsets of the remaining address sets. Theresulting filtered address sets are then grouped together according tothe oligonucleotides that are contained in each block. For example, thefiltered address sets for address oligonucleotides “CAT” and “ATG” havebeen grouped together in FIG. 20 c because these two oligonucleotidesare contained in sequence block “CATG”. In FIG. 20 c, the addressoligonucleotides found in the same block are identified by rectangularboxes. In addition, oligonucleotides that occur in the same block aregrouped together within each upstream and downstream subset.

Third, the filtered address sets are converted into block sets, as shownin FIG. 20 d. In a block set, the information from different addresssets is combined. Instead of a different horizontal line for eachfiltered address set that pertains to a particular block, theinformation in all of the address sets that pertain to that particularblock is combined into a single horizontal line. For example, in FIG. 19c, five different filtered address sets pertain to sequence block“TACCTTG”. In FIG. 20 d, these five lines are combined into a singleline in which the address oligonucleotides are replaced by an “addressblock”, shown as “TACCTTG” surrounded by a bold box. Similarly, theupstream oligonucleotides are replaced by upstream blocks, and thedownstream oligonucleotides are replaced by downstream blocks. Insubstituting sequence blocks for the upstream (or downstream)oligonucleotides that are contained in the filtered address sets thatpertain to a given address block, the following rule is adhered to: asequence block only occurs in the upstream subset (or in the downstreamsubset) of an address block, if every oligonucleotide that is containedin that address block occurs in the upstream (or in the downstream)subset of every filtered address set that pertains to that addressblock. For example, sequence block “CATG” occurs in the upstream subsetof Address Block “TACCTTG” because oligonucleotides “CAT” and “ATG”occur in the upstream subset of address oligonucleotides “TAC”, “ACC”,“CCT”, “CTT”, and “TTG”.

Often, a sequence block does not occur in its own upstream or downstreamsubset. For example, sequence block “CATG” does not occur in theupstream or downstream subset of its own block set (i.e., in block set“CATG”), because Oligonucleotide “ATG” is not present in the upstreamsubset of address set “CAT” and oligonucleotide “CAT” is not present inthe downstream subset of address set “ATG”. When a sequence block doesnot occur in its own upstream or downstream subset, this indicates thatthat sequence block occurs only once in the nucleotide sequence of thatstrand. However, a sequence block may occur in both the upstream subsetand in the downstream subset of its own block set. For example, sequenceblock “TGGTA” occurs in both the upstream subset and in the downstreamsubset of block set “TGGTA”. When a sequence block does occur in its ownupstream and downstream subsets, it indicates that the sequence blockmay occur more than once in the sequence. However, it does not indicatethat the sequence block definitely occurs more than once in thesequence. The presence of more than one parental strand in the originalmixture can introduce additional oligonucleotides into the filteredupstream and downstream subsets that can cause a block that actuallyoccurs only once in a sequence to appear in both the upstream anddownstream subsets of its own block set. However, further analysis ofthe data determines the multiplicity of each block in the strand (asdescribed below), thus resolving these uncertainties. For convenience,block sets that pertain to blocks that definitely occur only once in thesequence are listed together. For example, in FIG. 20 d, block set“CATG” and block set “TACCTTG” are listed together in the upper sectionof the block set diagram.

Fourth, the position of each sequence block relative to the othersequence blocks is determined. An examination of the block sets thatpertain to unique blocks (blocks that definitely occur only once in thenucleotide sequence of the strand) indicates their relative positions.For example, in FIG. 20 d, block set “CATG” indicates that uniquesequence block “TACCTTG” occurs downstream of unique sequence block“CATG”. This is confirmed by block set “TACCTTG”, in which uniquesequence block “CATG” occurs upstream of unique sequence block“TACCTTG”. The relative position of the two unique sequence blocks isindicated in FIG. 20 e, where the top line to the left of the arrowshows “CATG” upstream (to the left) of “TACCTTG”. The relative positionof the sequence blocks that can potentially occur more than once in thenucleotide sequence of the strand is determined from their presence orabsence in the upstream and downstream subsets of other sequence blocks.For example, sequence block “TAA” occurs in the downstream subset ofblock set “CATG” (and does not occur in the upstream subset of block set“CATG”). Furthermore, sequence block “TAA” also occurs in the downstreamsubset of block set “TACCTTG” (and does not occur in the upstream subsetof block set “TACCTTG”). Therefore, sequence block “TAA” must occurdownstream of both unique sequence block “CATG” and unique sequenceblock “TACCTTG”. This is indicated in FIG. 20 e, where the bottom lineto the left of the arrow shows “TAA” as occurring downstream of “CATG”and “TACCTTG”. Furthermore, sequence block “TGGTA” occurs only in thedownstream subset of block set “CATG”. Therefore, it must occurdownstream of “CATG” in the nucleotide sequence. On the other hand,sequence block “TGGTA” occurs in both the upstream and downstreamsubsets of block set “TACCTTG”. This indicates that “TGGTA” canpotentially occur in the sequence at positions both upstream anddownstream of unique sequence block “TACCTTG”. Finally, “TGGTA” onlyoccurs upstream of “TAA”. This is indicated in FIG. 20 e, where thebottom line to the left of the arrow contains a bracket that shows therange of positions at which “TGGTA” can occur, relative to the positionsof the other sequence blocks. At this point in the analysis, the diagramto the left of the arrow in FIG. 19 c contains all the informationobtained that pertains to strand set A.

Finally, the sequence of the strand is ascertained by taking intoaccount both the relative position of the sequence blocks, as shown inthe diagram to the left of the arrow in FIG. 20 e, and the identity ofthe sequences at the ends of the sequence blocks. The object of thislast step in sequence determination is to assemble the blocks into thefinal sequence. Four rules are followed: (a) each of the blocks must beused at least once; (b) the blocks must be assembled into a singlesequence; (c) the ends of blocks that are to be joined must maximallyoverlap each other (i.e., if the surveyed oligonucleotides are nnucleotides in length, then two blocks maximally overlap each other ifthey share a terminal sub-sequence that is n−1 nucleotides in length);and (d) the order of the blocks must be consistent with their positionsrelative to one another, as ascertained from the block sets. Forexample, in FIG. 20 e, “CATG” is upstream of “TACCTTG”. “CATG” cannot bejoined directly to “TACCTTG”, since these two sequence blocks do notpossess maximally overlapping terminal sequences (two nucleotides inlength). However, an examination of the permissible positions at whichother sequence blocks can occur indicates that “TGGTA” can occur in thegap between “CATG” and “TACCTTG”. The ends of these sequence blocks arethen examined to see whether the gap can be bridged. “CATG” can bejoined to “TGGTA” by maximally overlapping their shared terminalsub-sequence “TG”. Furthermore “TGGTA” can be joined to “TACCTTG” bymaximally overlapping their shared terminal sub-sequence “TA”.Similarly, the gap that occurs downstream of “TACCTTG” can potentiallybe filled by both “TAA” and “TGGTA”. “TAA” must be used, because it wasnot used at any other location. However, “TACCTTG” cannot be directlyjoined to “TAA”. The solution is to join “TACCTTG” to “TGGTA”, and thento join “TGGTA” to “TAA”. Thus, the sequence of strand A (which is shownin FIG. 20 f) is unambiguously assembled by utilizing sequence block“TGGTA” twice (as summarized in the diagram to the right of the arrow inFIG. 20 e).

The same procedure is followed to determine the nucleotide sequence ofstrand B (see FIG. 21). In this example, there are three sequence blocksthat do not occur in their own upstream or downstream subsets, and theytherefore definitely occur only once in the sequence of strand B(namely, sequence blocks “CTTG”, “GTCC”, and “TACC”). An examination ofblock set “GTCC” shows that “GTCC” occurs upstream of “CTTG” and “TACC”.However, an examination of block set “CTTG” and an examination of blockset “TACC” indicates that sequence blocks “CTTG” and “TACC” can bothoccur upstream and downstream of each other, which appears to conflictwith the observation that these sequence blocks only occur once in thesequence of strand B. There is actually no conflict. Each of thesesequence blocks does indeed occur only once in the sequence. It is justthat their positions, relative to one another, in strand B are obscuredby the presence of conflicting information from the relative positionsof oligonucleotides that occur in strand A. This ambiguity (indicated bythe identical positions of sequence blocks “CTTG” and “TACC” in thediagram to the left of the arrow in FIG. 21 e) is resolved as theremainder of the information is taken into account. The positions ofthose sequence blocks that can potentially occur more than once in thesequence of strand B is determined from other block sets. First, theblock sets of the sequence blocks that definitely occur only once in thesequence (namely, block sets “CTTG”, “GTCC”, and “TACC”) are consulted.The range of positions at which these other sequence blocks can occur(relative to the positions of other blocks) is indicated in the diagramto the left side of the arrow in FIG. 21 e.

The assembly of the nucleotide sequence of Strand B proceeds as follows:“ATG” is upstream of all other blocks. The uniquely occurring blockimmediately downstream of “ATG” is “GTCC”. “ATG” and “GTCC” cannot bedirectly joined. However, “ATG” can be directly joined to “TGGT”, so thecorrect order is to join “ATG” to “TGGC”, and then to join “TGGC” to“GTCC”. Neither “CTTG” nor “TACC” can be directly joined to “GTCC”.Three different sequence blocks can be used to bridge this gap (namely,“CCT”, “GTA”, and “TGGT”). The only combination of these three sequenceblocks that can fill this gap is “CCT” alone, which bridges the gapbetween “GTCC” and “CTTG”. This resolves the ambiguity as to therelative positions of “CTTG” and “TACC”. “CTTG” is therefore upstream of“TACC”. “CTTG” cannot be directly joined to “TACC”. Again, there arethree different sequence blocks that can be used to fill this gap(namely, “CCT”, “GTA”, and “TGGT”). The only combination of these threesequence blocks that can fill this gap is “TGGT” and “GTA” (i.e., “GTTG”is joined to “TGGT”, “TGGT” is joined to “GTA”, and “GTA” is joined to“TACC”). And finally, “CTA”, which occurs upstream of all other blocks,must be included in the sequence. However, “TACC” cannot be directlyjoined to “CTA”. There are three different sequence blocks that can beused to fill this gap (namely, “CCT”, “GTA”, and “TGGT”). The onlycombination of these three sequence blocks that can fill this gap is“CCT” alone. Thus, the assembly of the nucleotide sequence of Strand Bfrom its sequence blocks is completed. Note that some of the sequenceblocks that could potentially occur in the sequence more than once,actually occur only once (e.g., “GTA”). Other sequence blocks that couldpotentially occur in the sequence more than once, actually occur morethan once (e.g., “CCT”).

Thus, using the methods of this invention, the entire sequence of strandB is unambiguously determined, despite the fact that someoligonucleotides occur more than once in its sequence, despite the factthat more than one sequence block can be assembled from theoligonucleotides that occur in the strand, despite the fact that themultiplicity of occurrence of each oligonucleotide is not determinedduring surveying, despite the fact that the strand is analyzed in amixture of strands, and despite the fact that the other strand in themixture possesses many of the same oligonucleotides.

6.2 Determination of the Nucleotide Sequences of Strands in a Mixturewhen Some of the Strands do not Possess at Least One Oligonucleotidethat does not Occur in any Other Strand in the Mixture—

FIGS. 22 to 28 depict the determination of the sequences of four strandsin a mixture with each other using the methods of the invention. Theexample demonstrates the power of the invention to identify all of theoligonucleotides that are present in a strand (i.e., its strand set)when some of the strands (in this example, those of the four strands) donot possess even one oligonucleotide that does not occur in any otherstrand in the mixture.

FIG. 22 a shows the sequences of four short strands that are assumed tobe present in a mixture. As in Example 6.1, above, it is assumed thatcomplete sets of partials have been generated from this mixture ofstrands, and that each set of partials sharing the same addressoligonucleotide has been separately surveyed. The addressoligonucleotides and the surveyed oligonucleotides are assumed to bethree nucleotides in length. FIG. 22 b shows the indexed address setsdetermined for each relevant address in the partialing array. FIG. 22 cshows the unindexed address sets, and FIG. 22 d shows the unindexedaddress sets organized into groups according to the identity of theoligonucleotides that they contain. In this example, there are sevendifferent groups of unindexed address sets.

As in Example 6.1, above, each group of identical address sets iscompared to the other groups of identical address sets to see if itscommon address set appears to be a prime address set. This isaccomplished for each address set by seeing whether any other addressset is a subset of it. For example, in FIG. 22 d, the address set commonto group II is not a prime address set, because the address set commonto group V is a subset of the address set common to group II. Similarly,group III is not prime, because group V is its subset, and both groupsVI and VII are not prime, because group I is a subset of each of them.The remaining groups (namely, I, IV, and V) do not have subsets, andtherefore appear to be comprised of prime address sets.

Each putative prime address set is then tested to see if it is indeed aprime set. This is accomplished by examining all the address sets thatcontain all of the oligonucleotides that are present in the putativeprime address set. For example, in FIG. 23 a, all the address sets thatcontain all the oligonucleotides that are present in the putative primeaddress set common to group I are listed together (namely the addresssets contained in groups I, VI, and VII). Similarly, FIG. 23 b lists allthe address sets that contain all the oligonucleotides that are presentin the putative prime address set common to group V; and FIG. 23 c listsall the address sets that contain all the oligonucleotides that arepresent in the putative prime address set common to group IV. Each ofthese three putative prime address sets is then tested to see if it isindeed a prime set. The address set common to group V (analyzed in FIG.23 b) is indeed a prime set (and therefore contains a single strand set)because a list of those oligonucleotides that are found in every addressset in the diagram is identical to the list of addresses on the leftside of the diagram. The address set common to group I (analyzed in FIG.23 a), however, is not a prime set (and therefore does not contain asingle strand set) because a list of those oligonucleotides that arefound in every address set in the diagram (namely, AGC, ATG, CGC, CTA,CTT, GCT, TAA, TAG, TGC, and TTA) is not identical to the list ofaddresses on the left side of the diagram (namely, AGC, CTA, CTT, GCT,TAA, and TAG). The address set that is common to group I is therefore apseudo-prime address set. Similarly, the address set common to group IV(analyzed in FIG. 23 c) is also a pseudo-prime address set.

Pseudo-prime address sets are decomposed into strand sets by identifyingthe extra oligonucleotides that prevent the pseudo-prime address setfrom being a prime set. This is accomplished in the following manner: Inthe first step, a list is made of those oligonucleotides that aremembers of the pseudo-prime address set, but are not on the list ofaddresses whose address sets contain all the members of the pseudo-primeaddress set. For example, in FIG. 23 a, pseudo-prime address set Aconsists of oligonucleotides: AGC, ATG, CGC, CTA, CTT, GCT, TAA, TAG,TGC, and TTA. However, the list of addresses shown in bold letters onthe left of the diagram does not include: ATG, CGC, and TGC. In thesecond step, the groups associated with these “missed” addresses areidentified. For example, from FIG. 22 d, it can be seen that missedaddress oligonucleotides ATG and TGC belong to group VI, and missedaddress oligonucleotide CGC belongs to group IV. In the third step, newdiagrams are prepared that include one or more of the “missed” groups.For example, FIG. 24 a is prepared by adding the address sets from groupVI to the diagram from FIG. 23 a. Similarly, FIG. 24 b is prepared byadding the address set from group IV to the diagram from FIG. 23 a. Theset of oligonucleotides that are contained in every address set of thisnew diagram (they are seen as full columns) represents a putative strandset. For example, in FIG. 24 a, the putative strand set consists ofoligonucleotides AGC, ATG, CTA, CTT, GCT, TAA, TAG, TGC, and TTA.Similarly, in FIG. 24 b, the putative strand set consists ofoligonucleotides AGC, CGC, CTA, CTT, GCT, TAA, TAG, and TTA. The finalstep is to test each putative strand set to see if it is indeed a strandset. This is accomplished by seeing if the list of addresses on the leftof the diagram is identical to the list of oligonucleotides in theputative strand set. For example, putative strand set A1, analyzed inFIG. 24 a, is indeed a strand set, because the vertical list of nineaddresses on the left of the diagram is identical to the list of nineoligonucleotides that are found in every one of the nine address sets.Similarly, putative strand set A2, analyzed in FIG. 24 b, is also astrand set.

The decomposition of pseudo-prime address set C (identified in FIG. 23c) into its constituent strand sets illustrates an interesting aspect ofthis method. Its decomposition, shown in FIGS. 24 c and 24 d, gives riseto two strand sets, labeled “C1” and “C2”. However, a comparison of allthe strand sets identified indicates that strand set A2 is identical tostrand set C2. Thus, there are four strands in the original mixture,represented by strand sets A1, A2, B, and C1.

The sequence of each of the four strands is then determined by: (a)assembling the oligonucleotides in the strand set into blocks, (b)filtering the indexed address sets to only include information thatpertains to the oligonucleotides that are in the strand set, (c)converting the filtered address sets into block sets, (d) identifyingthe unique blocks (that only occur once in the sequence), (e)ascertaining the relative positions of the blocks from the informationin the block sets, and (f) assembling the blocks into the nucleotidesequence of the strand by taking into account both the relativepositions of the blocks and the sequences that occur at the termini ofthe blocks.

The power of this method is illustrated in FIGS. 25 to 28. For example,in the assembly of strand A1 (shown in FIG. 25), the top three blocksets in FIG. 25 d identify three blocks that definitely occur only oncein the sequence (namely, “ATGC”, “CTTA”, and “TAGC”), and these threeblock sets also indicate the relative order of the three blocks. Inaddition, these block sets indicate that both “CTA” and “TAA” can onlyoccur downstream of “TAGC”, and that “GCT” can only occur downstream of“ATGC”. Inspection of the lower three block sets in FIG. 25 d shows that“GCT” occurs upstream of both “CTA” and “TAA”, and that “TAA” occursdownstream of “CTA”. The nucleotide sequence of Strand A1 is thenassembled from a consideration of these positional constraints and aconsideration of which blocks can maximally overlap each other. The gapbetween “ATGC” and “CTTA” is filled by “GCT”. The gap between “CTTA” and“TAGC” cannot be filled by “GCT”, however, “CTTA” is joined directly to“TAGC”. The gap that occurs after “TAGC” can only be filled by joining“TAGC” to “GCT”, then joining “GCT” to “CTA”, and finally, joining “CTA”to “TAA” to complete the sequence.

In the assembly of Strand A2 (shown in FIG. 26), a consideration of theinformation in the two unique block sets (“CTTA” and “TAGC”) indicatesthat: “CTTA” is upstream of “TAGC”, “CGC” is upstream of “CTTA”, both“CTA” and “TAA” are downstream of “TAGC”, and “GCT” can occur at anyposition. It is easy to see that “GCT” occurs twice in the sequence,once to join “CGC” and “CTTA”, and once again to join “TAGC” and “CTA”.Although there is a gap between “CTTA” and “TAGC”, it cannot be filledby “GCT”, and the gap is filled by joining “CTTA” directly to “TAGC”.The sequence is completed by joining “CTA” to “TAA”.

In the assembly of strand B (shown in FIG. 27), “TGCTG” occurs upstreamof “TGGTA”. “ATG” occurs upstream of “TGCTG”, and “TAT”, “ATA”, and“TAA” occur downstream of “TGGTA”. It is easy to see that “ATG” isjoined to “TGCTG”, and “TGCTG” is joined to “TGGTA”. It is also seenthat “ATA” and “TAA” occur downstream of “TAT”, and that “TAA” occursdownstream of “ATA”. From a consideration of positional information andfrom a consideration of the sequence of the blocks, it follows that theonly permissible way to fill in the gap that occurs downstream of“TGGTA” is to join “TGGTA” to “TAT”, join “TAT” to “ATA”, and then join“ATA” to “TAA”, thus completing the sequence.

The assembly of strand C1 (shown in FIG. 28) is straightforward. Thereare two definitely unique sequence blocks (“CGCTTA” and “TATA”), andtheir order is known from their block sets (“CGCTTA” is upstream of“TATA”). The third block, “TAA”, occurs downstream of both uniqueblocks. The sequence of Strand C1 is determined by joining “CGCTTA” to“TATA”, and then joining “TATA” to “TAA”.

1-159. (canceled)
 160. A method of analyzing a nucleic acid, comprising:providing at least one oligonucleotide which is complementary to atarget sequence of interest in a genomic DNA sample; amplifying amixture of nucleic acids comprising a group of genome fragments by amethod comprising: cleaving a genomic DNA sample with a restrictionenzyme, thereby providing restriction fragments; ligating adaptornucleic acids to the restriction fragments, thereby providingadaptor-ligated fragments; hybridizing the adaptor-ligated fragments toimmobilized oligonucleotides that are complementary to the adaptornucleic acids wherein the immobilized oligonucleotides are attached to asolid support, and extending the hybridized immobilized oligonucleotidesusing the adaptor-ligated fragments as template, thereby providingextended immobilized oligonucleotides; and amplifying the extendedimmobilized oligonucleotides, thereby providing an amplified nucleicacid mixture comprising genome fragments; and hybridizing the at leastone oligonucleotide to the amplified nucleic acid mixture, therebyanalyzing at least one nucleic acid of interest in the amplifiedmixture.
 161. The method of claim 160, wherein the oligonucleotide is amember of an array of oligonucleotides, which array comprises additionaloligonucleotides which hybridize to different target sequences ofinterest.
 162. A method of analyzing at least one nucleic acid,comprising: obtaining a plurality of amplified genomic fragmentsseparated into discreet features of an array by a method comprising: (a)fragmenting a genomic DNA sample comprising at least one nucleic acid,thereby providing fragments; (b) ligating an adaptor to the fragments togenerate adaptor-ligated fragments, wherein said adaptor comprises auniversal priming sequence; (c) providing an oligonucleotide arraycomprising oligonucleotides that are complementary to the universalpriming sequence in the adaptor, wherein the oligonucleotides areattached to a solid support; (d) hybridizing the adaptor-ligatedfragments to the oligonucleotides on the solid support so that fragmentsof different sequence are hybridized at different discreet locations ofthe solid support; (e) amplifying the adaptor-ligated fragments byextending the oligonucleotides using a DNA polymerase to obtainimmobilized extended polynucleotides of different sequences andamplifying the extended immobilized polynucleotides, thereby providingan array of amplified genomic fragments of different sequences presentin different discreet locations of the array; and (f) analyzing at leastone of the amplified genomic fragments.
 163. The method of claim 162wherein the step of fragmenting a genomic DNA sample comprisesfragmentation with a restriction endonuclease.
 164. The method of claim162 wherein the array of oligonucleotides is an array of regularlysituated areas on a solid support, wherein different oligonucleotidesare immobilized by covalent linkage.
 165. The method of claim 164wherein each oligonucleotide comprises a common region and a variableregion.
 166. The method of claim 164 wherein the variable regions varyin sequence or length.
 167. The method of claim 163 wherein the step ofligating an adaptor to the fragments restores a recognition site for therestriction endonuclease.
 168. The method of claim 160 wherein theadaptor sequence is appended to both ends of the fragments.
 169. Themethod of claim 160 wherein prior to amplifying the extended immobilizedoligonucleotides, the solid support is washed to remove non-covalentlybound materials from the solid support.
 170. The method of claim 160wherein said oligonucleotides are attached to the solid support at the5′ ends of the oligonucleotides.
 171. A method of analyzing a pluralityof different nucleic acid sequences in a complex nucleic acid sequencecomprising: (a) fragmenting the complex nucleic acid sample to obtain aplurality of different sequence nucleic acid fragments; (b) ligating afirst adaptor sequence to the 5′ ends of the fragments and a secondadaptor sequence to the 3′ ends of the fragments, to obtain a pluralityof different sequence, adaptor-ligated fragments; (c) hybridizing theadaptor-ligated fragments to an array of oligonucleotides attached to asolid support wherein the oligonucleotides are attached to the solidsupport at the 5′ end and have a free 3′ end, and wherein theoligonucleotides comprise a sequence that is complementary to the secondadaptor sequence; (d) extending the oligonucleotides with a polymeraseusing the adaptor-ligated fragments as template to obtain extendedoligonucleotides that comprise at their 3′ ends the complement of thefirst adaptor sequence; (e) amplifying the extended oligonucleotides toobtain a plurality of different nucleic acid sequences by hybridizing aprimer to the extended oligonucleotides, wherein the primer iscomplementary to the complement of the first adaptor sequence andextending the primer to obtain a copy of said extended oligonucleotides;and amplifying the copy of the extended oligonucleotide; and (f)analyzing the plurality of different nucleic acid sequences.
 172. Themethod of claim 171 wherein step (e) comprises extending said primers inthe presence of a labeled nucleotide.
 173. The method of claim 172wherein said labeled nucleotide is a dideoxynucleotide.